 So, thank you all for coming and I hope that you're all tuned in and you're ready to deal with the world of unique identifiers and, like, cell challenges and the like coming out of Gary's talk. We're going to hang out together for pretty much the next day and we're going to try to work our way through a lot of issues around gene regulation. I was listening to some of you describe your own lists and your projects and there's some things that I'm not doing in this that I may try to add a little bit in later this afternoon because I particularly heard there were some people with exome lists and other challenges and we're facing those types of issues and so I'll introduce you to a couple of tools that we have for trying to find our way to interesting sense of genes within exome analysis. So, I'll sneak that in in the afternoon as an extra piece. What we're going to try to do today and tomorrow is we're going to sort of alternate between me talking at you and then you trying to do some stuff and then me talking at you some more and then doing some stuff. So we'll flip back and forth over the course of the time. The timing I'm sure won't be precise to the schedule so if I finish a little early then we'll jump into the lab work a little early if we if you're clearly saturated with the laboratory work then we'll jump back into the speaking as we need to in the flow. But I think it's reasonably close to on schedule and I have I'm multi-armed with multiple laser pointers so we should be should be good. Okay, so I'm a researcher who's particularly interested in gene regulation and transcription factors and I think that in this context this class in this workshop my goal for you is not that you're going to be the master of all things on regulatory sequence analysis but that you'll have a pretty good understanding of the types of information that are out there, how you can go and get some of that information, how you can do a little bit of manipulation of that information. Some of those things are general purpose tools and so you'll have some things that are going to be useful to you no matter what you're doing and some of them are a little more specialized. As you move into the pathway components of the class you're going to see some of the same themes coming up again and so you'll see some utility of the general purpose methodology. I'll also note that the slide pack you have has a few typos and a few things that I've swapped out in the presentation and I'll continue to refine the presentation as I go along so that there may be a few other additional changes from your paper deck. There's really only one added slide so far so the numbering on your slide shouldn't be too bad but you'll see a few small changes. It's all on the wiki by the way so if you're following on your screen it's on the wiki for you to grab. Okay so the course of this flow we're going to do an overview of transcription that's going to be ultra-fast so you're not going to get the biochemistry 405 version, you're going to get the biology 291. You're going to get some information about the prediction of transcription factor binding sites because it's core to everything else that comes along so you'll need to have that understanding as a basis for some of the things that happen thereafter. Then we're going to explore the detection of novel motifs which is pattern discovery so given a bunch of sequences and you don't really know what the pattern is that's in them can you recover the transcription factor binding sites that are in there. And that by the time you get to that you should have a pretty good idea where a lot of the foundational components are coming from that you're using in regulatory sequence analysis. The next phase after that is we're going to go into interrogation of sets of co-expressed genes so you've got your gene list and you want to identify whether there might be a transcription factor or multiple transcription factors that are acting on those genes and so we'll talk about how the methods that are used for doing that. In each of these particularly in this case, in this case we'll do lab sections where we'll go through and try some of the tools that use these methods so that you can get your hands on them. If you're superstars then you can plow through the one that I give you there and I've got a few other tools on the list for you to go and try as well. The integrated assignment tonight will unite some of the parts from these as well so you have a chance to take the canned version this morning and work your way through the one with a lot of support and then tonight you'll have a chance to do it on your own for another set and if you have your own gene list of course you're welcome to try some of the things on those. And then tomorrow we're going to drive into regulatory networks which essentially means how do we look at combinations of factors acting together instead of one factor at a time which is going to be the focus of the day. Is that good? OK. I'm a big believer of hands up and questions and so please ask if there are things that you think about you've heard about that you don't understand how it relates to something else please ask because this is the great benefit of having it in this style format and you'll be much happier if you're talking. OK. So let's talk about transcription. I'm going to be focused pretty much on human and mouse in the flow of this. The good thing is that transcription pretty much is universal and the methods here are largely universal. Some of the chip seek things get a little bit different when you're working in bacteria but by and large the bulk of this is true throughout multicellular organisms. OK. So this is the view of transcription factors as much of bioinformatics is based on. There's a protein, it binds to DNA, it casts out this magical mystery signal and it brings a polymerase machinery to the transcription start site and we can now proceed down making RNA. We all know that that's too simplistic, that we have a much more complex and rich world of chromatin structure and gene regulation control mechanisms. But by and large the first box that I'm going to talk about today is going to be focused on this concept from just this idea of a protein sticking to the DNA. But let's put ourselves in some frame of mind about this. Here's some terminology. Everybody uses this slightly differently and so there's no perfection in what I'm telling you here. But I'm going to tell you how I use some of the terminology so that when you hear me saying things you'll know where it's coming from. And I'm also going to try to educate you a little bit about transcription. I feel like I'm blocking your view over here so you're good. OK. So here's our DNA. Here's a gene that's situated along the wall here. The gene is a multi-exon gene. There's more exons cascading down the wall. At the start of the gene is what we'll call the transcription start region. Some people refer to that as a transcription start site. You'll see it oftentimes in the literature defined as a transcription start site with a specific coordinate. And one of the things we've learned over the last few years in terms of regulatory sequences is that it's almost never a single transcription start position. But it's a region where there's transcripts initiating. And there's a small subset of genes which have a particularly well-defined transcription start site. But in almost all cases, it's a messy smudge of transcription start positions that are there together. From nomenclature purposes, it makes a bit of a disaster because a lot of the things that we see in the literature describe things relative to some start position, usually not defined which start position they're looking at. They'll give you some magic number about minus 79 or plus 265, but you don't know what plus 1 was. Depending on which transcript you look at, you'll see different things. So if you're going to refer to a transcription start in a paper, please refer to a specific coordinate on a specific assembly so that we all know what you're talking about. But I refer to it usually as a transcription start region. Occasionally, I'll decay and use a transcription start site. The transcription start region has around it what we'll call a core promoter sequence. And the core promoter sequence is the region that's roughly from about minus 100 to plus 100. And it's really the sequences that's involved in positioning the polymerase complex to allow the transcription to initiate. In some genes, there's a tata box. It's about 30% transcription start regions. About 30% have a well-defined tata box. The rest of them don't really have a well-defined tata box. Some have a downstream proximal element, which I'm not showing here, which would be the plus side of about plus 30 relative to the transcription start. And so this core promoter sequences are what people classically talk about promoter regions. A lot of literature will refer to promoters and think about them much more broadly. And they'll say something like the promoter region. And they might mean something like a few thousand base pairs on either side of the transcription start region. I tend to refer to those as proximal regulatory regions because there's relatively little to distinguish a proximal regulatory region from a distal regulatory region, meaning it's far away. And so there's a fuzzy line between proximal and distal. And it just is sort of from a frame of mind of what you think is proximal and what you think is distal. Usually proximal means it's sort of right up against the core promoter region. The proximal regulatory regions and the distal regulatory regions that have transcription factor binding sites within them that are involved in turning on the promoting the transcription of the gene. These distal regulatory regions can be located very far away. When we look at analysis tools, often times we focus on sequences that are relatively close to a gene. The distal regulatory regions can be upstream. They can be in the introns. They can be in the non-coding exons. There's even a few cases where they overlap into the coding sequences, so that's pretty rare. They can be downstream. They can be hundreds of thousands of base pairs, millions of base pairs away from the gene because in the three-dimensional structure of the nucleus, those things are not necessarily very far away in real space. There are even a few cases in the literature where people are suggesting that there are regulatory sequences on other chromosomes that are acting on genes because in three dimensions, you can, in theory, have another chromosome close by to a promoter and acting as a cis regulatory sequence in a kind of a weird definition of cis. So determining what is a regulatory region is hard. Determining what gene a regulatory region that is acting upon is even harder because they can skip over genes again because of the three-dimensional folding that goes on in the nucleus. So from this slide, you should have a few terms, core promoter regions, regulatory regions, distal and proximal, transcription factor binding sites or TFBS, orientation. So in general, the core promoter region is thought to be more of an oriented sequence so that its primary effect is in a certain direction. There are all sorts of promoter regions that are bi-directional and in fact, there's plenty of evidence now to say that when you get a polymerase complex being recruited to a region, a subset of the time it's gonna be recruited and gonna spin a transcript off in the wrong direction. So you get a lot of transcripts going, all sorts of places when you get deep into RNA-seq data because you'll see that the polymerase isn't that perfect a machine and it will be producing things. And whether it's biologically relevant or not is unclear so there's a lot of discussion in the literature around what that means that you're producing transcripts willy-nilly. So the promoter region is generally thought to have sort of a directionality to it. The regulatory, the more distal and somewhat proximal regulatory regions, most of the evidence suggests they can be turned around and that they're fairly orientation-independent. But that's said with some caveats. So there is some orientation characteristics to many of them. But by and large they should function in either direction. Okay, so that's sort of the simplistic view. DNA is a naked piece that you're looking at on a line on a page. But as we all know, this regulation occurs within a complex three-dimensional structure of chromatin and that there are many layers of regulation going on top of the transcription factors that are acting on the DNA. So when we talk about the core promoter region and this sort of set up of the polymerase complex, we know that these transcription factors are acting through multiple mechanisms. They act through co-activator complexes, particularly through histone modifying systems that allow it to open up and make the region more accessible or maintain the region is more accessible. There are additional factors like the mediator complex which are involved in bridging the interactions to the polymerase complex. And then within the system, we can start thinking very in a very detailed fashion about the chromatin structure and so how the DNA is wrapped around the histones and the modifications that can be made onto the histones that will promote their packing or unpacking to allow the regulatory system to come into play. And so when you go in and you look at the data that you have available to you, you find that there's a rich source of data, sources of data that have been generated over the last few years. Those projects, particularly the encode project but a couple others that are irrelevant, have done a very nice job of revealing regulatory regions, particularly in a few very well-studied cells types. Outside those few very well-studied cell types, it's a little less clear because each of these is context dependent. So it depends on the type of cell, the developmental state, the physiological conditions and the like. So what types of laboratory data are available? This isn't a complete list and you'll think of other ones as you go along. I've added a few to the page on the screen as compared to what you have on your notes. So defining where the promoter regions are located is largely been resolved by a combination of two related methods. So RNA-seq has done a brilliant job of identifying where transcripts are being produced. And CAGE data in particular has revealed most human and mouse promoter regions at this point. And so you might ask, what's CAGE data? CAGE is a technology that's oriented towards capped RNA, so it's mature RNA products. And it's been developed largely by the Rican Institute in Japan where they do run a series of projects called the Phantom Projects. And the online you'll find in the UCSC genome browser that you can access some of the older Phantom data sets which map out these promoter regions. There's a new Phantom Project that will be published later this year. And in that project, they use a much higher throughput methodology for doing their CAGE analysis. And they profile over 1,000 cell lines and cells and tissues to get a breadth of regulatory promoter regions across human and mouse systems. And so all of a sudden we're gonna go from a decent, pretty good understanding where promoters are to now what I would view as an almost complete understanding where promoters are because we're gonna have all of that information. It's also quantitative, so it's giving you expression data about where each of the promoters is active across those cell types. And so that data will be published probably in September or October. And it will be pretty transforming in terms of how we think about transcription starts. We also use epigenetic marks, so that these histone modifications that are made on covalent modifications onto the histones in a variety of positions. And there's been a fair bit done with looking at chip seek on the polymerase complex itself. By and large, the CAGE data, I think, is gonna be used as the principle method and the principle data set for defining where the promoters are going forward. Okay, so then how do you define the regulatory regions? So either distal or proximal, where are these sort of active open regions where the DNA is being bound? There have been a few sets of data generated over time, and they are all providing useful information at this point. So coactivator chip seek has been used a few times. This is where you look at those coactivator proteins and you do chip seek experiments on them like on P300. And there have been a relatively small number of papers, but what's been shown is that they're exquisite for locating where these regulatory enhancers are and how active they are. And so where you have coactivator chip seek data, it will do a pretty good job of defining where the active regulatory sequences are in the system. The problem is that there are about 11 coactivators in the human system, and they're only running decent chip on about two of them. So about nine of them, the antibodies are not quite good enough and they're not quite getting clean data. So those nine are still somewhat invisible. The epigenetic marks, so looking at his stone three lysine marks at the fourth position and the 27th position and the other positions, and I'm not gonna give you an epigenetic lecture on every different mark that's possible, has also shown to be of high correlation with regulatory regions. This has been a primary product of the ENCODE project, which I'll talk about in a minute, just to give you a sense of where some of the data is coming from. Recently, it's been running for a while now, but there's been a few new generations of data sets on DNAs one hypersensitivity. So DNAs one hypersensitivity is an old biological assay looking to show that if you mix DNAs with this chromatons, where can it get access to and nick the DNA? And what a few groups have been doing, particularly John Stam's group at the University of Washington has shown that you can do exquisitely deep sequencing on a DNAs one hypersensitivity assay. And not only identify those open and accessible regions, but you can actually now with those methods footprint where the transcription factors are sitting. So you can actually read off where the individual proteins are protecting when you go deep enough in the DNAs one hypersensitivity systems. And interestingly and sort of a parallel product of the phantom project and a few others is that it is now being shown that you actually get small amounts of RNA being generated out of these active enhancer sequences. So these are not genes, but that you're seeing the enhancers that are located in the phantom project what they saw when they did their extraordinarily deep sequencing, which they didn't do on all their samples, but on a smaller subset was that you get a small amount of RNA being generated going about evenly divided on either side of enhancer sequences. And so that is now revealing tens of thousands of enhancer sequences quite reliably and exquisitely. The challenge, the benefit of that is that because it's generated RNA seek data is you actually know which expression context those enhancers are active in. So where you see the data you know that that enhancer was available and done. It is however cost prohibitive to some extent because you're looking to go about 10 times deeper than you do for normal RNA seek. So as the technology moves along you can get there. Okay, so these are our experimental methods for defining where promoters and regulatory regions are situated. My own passion is on transcription factors. So I really care about these individual proteins that stick to the DNA. There's about 1500 of them in the human genome. We only have deep data for about 120 or so of them. We have some data on about 250, 300 of them. And about a thousand of them we really just don't know very much about them. We can have some sense of their binding specificities and that's about it. Most of the data that we get on transcription factor binding sites is now generated out of chip seek. So individual. What is it? Yeah. And so that's where it developed a principally out adjacent leaves group in North Carolina which has been a very nice way of identifying things. And in my experience in looking at it there's a very, very high correlation between it and the DNA hypersensitivity data. So I think they're both going to sort of the same general concept of open regions but the fairer sets have been very, very high quality data sets for a while. So thank you for pointing that out. Other ones that I missed that you love? Okay. So to the transcription factors there have been, most of the data that we now have is from chip seek data on transcription factor binding. So when you look at all the detailed laboratory analyses that have been done over the years they've revealed something on the order of a few thousand regulatory sequences where they go through and test them classically with putting them in front of promoters and in their reporter systems making mutations on them doing gel shifts to test the binding of them. These sort of detailed time consuming products. So there have been a few thousands of regulatory regions defined by the old classical methods and then any given chip seek experiment will generate about five times more than all had been done in individual studies over time. So chip seek is the way to go. It is limited to those cases where you have a decent antibody for the transcription factor that you're interested in. And so that antibody limitations is quite restricting on that because not every antibody works for chromatin IP and not every transcription factor has been given enough attention to develop a good antibody. Okay, so this is an extra slide that I tucked in here because I wanted to talk to you a little bit about chip seek. So this is mostly to get some terminology that's gonna come up in the lab that we're gonna do in a little while. So this is a picture of the UCSC genome browser. It's a track taken with a few different experiments. In the tracks that are here, we're seeing some histone modification marks. This is an H3K4 monomethylation mark in this comb. And while normally it would be a layered one with multiple cell types on this particular track, I've restricted it so that we're only looking at cells from one particular cell line. And what you see plotted along the genome are an indication of the reads coming off the, the signal strength coming off the chip seek data. And there's essentially a depth of data compiled in the high throughput sequencing of the chromatin IP reaction. I should add track. Does everybody know about chromatin IP and what those reactions are? Would anybody like a description of it? Okay, so let me do a quick description of chromatin IP. I can draw it for you later, but I'll do it just verbally for a quick second. So for an chromatin amino precipitation reaction, what you do is you take a cell or sample that you're interested in. You're going to try to see what proteins are sticking to a what piece of DNA. And so what you initially do is you cross-link in some way the protein DNA mixture together. So you get a covalent attachment between the proteins that are there and the DNA. Then you're going to shear the DNA in some manner. And so there's a few different methods for doing that. But basically you're gonna try to cut the DNA up into smaller pieces. And so now what you end up with is a smaller piece of DNA and in some places a covalent modification with the DNA attached to a protein. You then take an antibody that recognizes that protein specifically. And so that antibody stick to that protein that's stuck to the DNA. You wash away all the other stuff that the antibody didn't stick to. And so now you have a complex, which is your antibody attached to your protein, attached to your DNA. Then what you do is you reverse the cross-link. So you break the covalent linkages. You then recover the DNA that's there and you essentially take that into a high throughput sequencing machine and you see what piece of DNA came down. Now, when you do that from a bulk sample, so it's not a single cell, but there's some efforts to get to single cell, but it's a bulk mixture of cells. You then take all that DNA sequence data that you compiled and you map it onto the genome, which I'm not gonna get into for how to map onto the genome right now. Happy to talk about that to anyone who wants to later on. And then you essentially look to see what's the weight of evidence of how many reads do you have coming down to any given point. The scoring and evaluation of that can take into account a few different ways to correct the background. And so in a chip seek experiment, there is an extreme bias. So chip seek experiments, any chip seek experiments with no antibody involved at all, just recovering DNA that shears is gonna give you regions that are prone to be at promoter regions. So you're gonna get open chromatin very much in a bias manner out of a chip seek experiment. So in the early days of chip seek experiment, you'd see these wonderful papers. Everyone would say, my chip seek experiment worked because I have most of my reads that are around my transcription start sites. But what they didn't tell you was that if they did the same experiment with no antibody, they would get the same results. Okay, so now you've, what you then do is you try to make some correction for background. So much of informatics is always this idea of a foreground versus a background and you're gonna hear that over and over and over again over the next three days. So you take your foreground and you say how much am I seeing of the reads in my foreground and you compare that to some background and you say how much am I seeing in my background and you say, am I shocked by what I'm seeing or is it look pretty much like the background? There have been now some ways to computationally simulate the background, take in other data sets that give some better measure of it. And so depending on the peak calling tool that you use, which is a software for doing this test, some will use an experimental background, some will use a more computationally derived background. In either case, you ultimately emerge within each position in the genome, you give some indication of how strong is the signal in your foreground against your background. And that's essentially what you're looking at in these plots when you go to the genome browser or any other tool that lets you look at chip seek data. And so what you're seeing as a measure of how strong is the signal versus the background. And what you'll see over in when you're looking at chip seek data for transcription factors, this is a particular histone modification, which is similar. What you'll see is that there's a lot of low level noise so that across the genome you see a lot of stuff just coming through and that's because this chromatominal presentation experiment is kind of messy. You're gonna get a lot of weird stuff coming down and so you have a base level of noise. And then in some places you're gonna see a lot of stuff going on. So you're gonna see a lot more reeds coming than you would have expected by your background chance. So we've talked about reeds, which are these individual sequences that came out of your reaction. What you'll see then is that there are these blobs here where there's a stronger evidence of something happening and those will be called peaks. So the peak in generally in the terminology, the peak is the whole region that's in that blob. So there's a peak here and it has sort of a start and an end to the peak. So you apply your peak calling software depending on the software that you use, you'll get slightly different variants of it, but you'll get a start and an end position. Now, within that peak you can see that not all positions are treated equal and that somewhere within that peak there's gonna be a maximum position. And so that's called the peak max. And that's actually a fairly useful piece of information to have as I'll show you a little bit later on. Now, within that region, so we've now got a start and end and a peak max. The peak max need not be in the middle of it. So in fact, it's almost never in the middle of it, somewhere off from the middle of the peak. A lot of the software will assume that when you give it peak coordinates that it's gonna treat the middle of it as the peak max. And so you need to know when you're using software to analyze this type of data, what's it using and what's it doing? Because you may need to give it a peak max separately. You may need to manipulate your coordinates to use the peak max as the center. There's some issues that come into play of how you use that peak max position. There are an enormous number of peak colors. So bioinformaticians flock like insects to data. And so what they'll do is they'll see a new type of data come and then a hundred groups around the world will flock to that data set, data type, and they'll develop their own unique tool and their own unique method for doing it. So these things proliferate like crazy. And then the community decides which ones of them are critically satisfying and then they die off and you're left with a few less. Question here. It's not always clear. So you have to read the documentation. Usually when you see a score, not always usually. Yeah, usually the score that you'll see will be associated with the peak max score. Sometimes you'll see something that's more of an overall average of the peak. So it is inconsistent how they handle it. Some software, depending on the software you use, they'll have different attributes. And so some things that you might call at two peaks in your view might be joined together because they're not far enough apart in space. And so it might just join a bunch of things together that are nearby. Just because you have a start and an end doesn't mean that you have a nice Gaussian type shape. You might very well have something that looks more multi-peaked. So does that give you a sense of what chip seek is and what data looks like? So this type of data is available plentifully online. And it's big as Gary was mentioning earlier. So these are enormous data sets and they're not always easy to manipulate. They're terribly inconvenient for classroom-based instruction as you're gonna find out in the next little while, but we'll do our best. There's a few places to go online to get gather large amounts of the data. The UCSC Genome Browser is particularly convenient format so they keep large amounts of this data available as downloadable tables that you can pull out. The ENCODE Project, which is a specialized repository within the UCSC system, has the bulk of the data that's been generated and released to the public up till now. And so this is focused chromatin immunoprecipitation-based profiling by and large, looking at epigenetics, looking at about 100 transcription factors, looking across up to 50 cell types, I believe, in the biggest case. So you can go there and you can pull down large amounts of data. There's also, not shown with the link here, there's also the MOD ENCODE Project, so for those of you who work with model organisms, there's a few of the model organisms that have been profiled in a similar manner. All sorts of papers will also have their own repositories where they'll have some sort of specialized set online that they can go to and pull out information on their system. The Gene Expression Autonomous or GEO, which many of you may know about has expression data, but it also has chromatin immunoprecipitation data. So you can go and grab a lot of the data straight out of the GEO, which has probably more of the independently published data as opposed to the large bulk of data that's been produced. For transcription factor chip-seq data, and a few others, but primarily for transcription factor chip-seq data, you can go to these places. There's also a couple of data sets, so Pizarra's a project that's run out of my lab where we try to compile transcription factor chip data into a common place, and there's a group at Johns Hopkins that runs one called HM chip, which you can pull out some of the chip-seq data from. If you're interested in other model organisms during the course of the afternoon, we can go online and we can take a look to see where the best repositories are for you, or we can fact down most of them. Okay. Okay, so that was a brief overview of transcription and some of the key types of data. Now we're going to go into some informatics or old pieces where we're going to talk about how do we use some of this data, and we're going to focus on transcription factor binding sites. Some of the methodology also relates to microRNA type target analysis, so you're going to see a generality to that. Okay. So our job is to teach a computer to find a transcription factor binding site. This is work that dates back into the 1980s, so this is not a new domain. I'm going to walk you through sort of the core ones that are still the popular methods that are in use today, and then I'm going to tell you that it's all washing away and that over the next five years we're not going to see these types of things so much anymore and I'll tell you why. But for the next five years it's probably going to be the dominant one that you're going to encounter. Okay, so old days style. Some poor graduate students slogged through a gene. They mapped out a regulatory sequence by taking pieces of the gene and testing to see if they would drive expression in the context that was interesting. They eventually come down through mutagenesis and deletion analysis and map out a single transcription factor binding site that there are protein sticks to and they might even know what the protein is. And then they get their PhD and graduated in 1979. So, so too late. That's what you had to do for a PhD about 1980. In the early 80s you could do it by getting a bunch of binding sites. So you could do either through testing large numbers of things or using some higher throughput assay which was largely a cell ex assay where you would mix a pool of, a random pool of DNA with a protein that you're interested in and see what sticks to the DNA, to the protein you're studying. Wash away the stuff that doesn't stick. Repeat it a few times and get a purified set. So in that way you could get a bunch of binding sites. And then computationally you could align all these binding sites together. And it wasn't always done computationally. Sometimes the poor graduate students would sit there by hand and manipulate the alignments to do it. And ultimately you get some sort of alignment of binding sites. And then what you can do is you can count how many times you see every nucleotide at every position. You may occasionally see a consensus sequence. This is an IU-PAC degeneracy code-based DNA consensus sequence. Most of those, that's pretty much washed away and people don't use consensus sequences anymore because they don't quantitatively reflect the data. So it can be better than that. And so most of the time what you'll do is just count every nucleotide at every position and you'll record that in a matrix. And so that matrix is called a position frequency matrix or a PFM and it corresponds to the positions of this alignment. So in the first position, first column of the alignment, you're gonna see 14 As, three Cs, four Gs and zero Ts. Okay, so that matrix is the bread and butter of transcription factor analysis. And it's what you're gonna be using in some form for the rest of the day to work with on some of your analysis. You'll be generating them, you'll be applying them, you'll be using them in bulk to study sets of genes. Looking at them is terrible and so you have better ways to look at them but using sequence logos. And the sequence logo is essentially a measure of the information content. So how strong is the pattern at each column of this matrix? And so in DNA, which has four possible outcomes, you have two bits of information. So each yes-no question is a bit of information. So is it a purine yes or no? If it's a yes, then you can say is it an A? And so by asking two yes or no questions, you're gonna define which basis that a position. So in an information content logo plot for DNA, you'll see the maximum position that's possible is two bits. And you'll see that some positions are well-informed, some positions are relatively not well-informed. What does this reflect? By and large what this reflects is where you have direct physical contacts between the transcription factor and the DNA. So where you're getting a connection, a physical connection between an amino acid touching onto a base, you're gonna see a very strong information content. Where the protein comes away from the DNA and doesn't have a direct interaction with the base, doesn't really matter what base you put there. And so you can imagine that the nature of the protein sticking to it, there's gonna be contact points that are high information positions and there's gonna be removed points that are not important and there's variable. Now, there's all sorts of caveats about the use of matrix models of transcription factor binding. They are very good for systems that follow the consensus of which about 95% of all transcription factors behave nicely. About 5% of transcription factors do not behave nicely and I'll tell you about those in a couple moments. Okay, computationally in the system, the tool does not use, ignore the brackets here. Computationally in the system, we don't use the frequency matrix as the computing tool. We use what's called a position-specific scoring matrix or a position-weight matrix. And what that is is it's a system that converts the frequencies to some sort of weighted score. And so you weight the score of the frequency that's observed versus the background probability of seeing that frequency. So by and large, most of the time you see the use of matrix models, they're gonna assume that the genome is 25% A, 25% C, 25% G, and 25% T. If you're working in an organism that is rapidly, massively outside of that range, then you would probably wanna retune your matrix models to use that background frequency. But most of the time it's assumed to be 25%. And it's a log converter scale and what that allows us to do is some computationally efficient tools because we can add logs to get a score as opposed to dealing with more computationally intensive multiplication steps if we wanna get a total probability. So this matrix here gets through this conversion, gets converted to this matrix over here. Now you'll notice this S value over here that's called a pseudo value in the system. And the reason that you have a pseudo value is varied. So some people say it's a weight for the confidence in the pattern. And some people say that it's because if you take a log of zero, you have a problem. And so you're gonna have to stick some value in there. And the way that pseudo count score is given varies from tool to tool, but usually it's something like one over the number of sequences that are contributing to the frequency matrix. So for instance, you'd add a pseudo value in this one of 0.2 to each of those zeros, which reflects that you're not absolutely confident that those zeros are there. If you have a thousand sequences that you'd add one over a thousand, which says okay, the zero is probably really zero in the matrix. Okay, so now you have this position specific scoring matrix, often called a possum. So you'll hear a possum mentioned a few times over the next 24 hours. Now given any sequence, any DNA sequence, we can assign a score to it. And we can assign the score to it by simply taking the corresponding cells from the matrix for the given nucleotides. Do we have any questions so far? Okay, so you sum up the scores and you get a total score for the sequence. So here's a couple of comments about those scores. This is a matrix for SP1, the popular study transcription factor. Here's the position that we're scoring. We take the corresponding cells, we add them up. And that gives us some sort of absolute score. The absolute scores vary depending on the matrix. So it depends on the width of the matrix. It depends on the amount of sequences that you had contributing to the matrix. So the absolute score in and of itself is specific to the matrix and doesn't have any meaning to you if you're trying to generalize across a database of matrices. What you'll often see, though not always see in the tools that you find will be the use of relative scores. And a relative score essentially places this on a spectrum of 0% being the minimum score and 100% being the maximum score. And so it's essentially a statement of the range where you fall within the range of possible scores. And the nice thing about relative scores is that you can now apply them to any matrix and use it. So you'll see some tools that will use relative scores. There's an increasing use of empirical p-value scores. So instead of taking the relative score, they generate, they take a pool of sequence of some sort and they have to define what that pool of sequence is. They take a pool of some sequence and they generate the distribution of the scores. And what you're gonna see in almost all possums is that the distribution is an extreme value distribution. So it looks a little bit like a Gaussian, but it has a long tail to the right. And so they'll then do is take some sort of threshold that's based on the distribution and they'll determine an empirical p-value which essentially is the amount of sequence that you're allowing, amount of the area under the curve that you're allowing to the right. So those will be p-values. I'm not terribly fond of the p-value representation because it makes you think that there's some sort of significance to it. And I'm gonna tell you why I think that you need to be careful with both interpretations of both of these in a moment. Okay, there are lots, there are several databases of these profiles and that's where it becomes interesting to you because all of a sudden you have databases of these things that'll let you take a gene list and look to see within these gene lists whether there's sets of these profiles that are showing enrichment on your gene list. There are the old-timers, Transfac, Jasper's one that my group makes. There's a few others that have come out over the last few years that are quite good. So there's one called Swiss Regulon, which is in Switzerland, which is a good quality one. There's a group in Russia that's been doing a very nice job lately, that's one called HocoMoco. So there's, and HocoMoco basically tries to go to several of the databases and go and pull out all the profiles they can get and combine them together. And then there's another type of data which is available through the Unipro database which is a protein binding array type format which I'm not gonna get into the details of if you're interested, I can talk about it. Okay, so you can go and find databases of these profiles. There's a few hundreds of them available, a few hundreds of the profiles that are high quality. For the protein binding arrays there, you can go up to larger sets. They have some slight differences and characteristics from the other profiles. Okay, so let's go through the good and the bad and the ugly. The good is that when you test these predicted sequences in vitro, generally the protein will stick to it. So if you call it as a binding site using one of these tools and you test whether the protein will stick to it, the protein will stick to it. And there's been shown repeatedly that there's a strong correlation between the score and the binding energy to the factor. So that as you look at the strongest sites, the binding energy at the upper end is correlated with the score itself. That sounds pretty good. The bad, thickets and others have shown that these profiles make predictions all over the place. So when you take one of these tools and you apply it to a sequence, you're gonna make predictions all the time. And it's not hard to think about why. So you're looking at a pattern that's a relatively small pattern with a relatively few positions that are highly informed. And now you're gonna look at a genome which is a multi-billion base pair genome. And you're gonna say, okay, how often am I gonna expect, and it's double-stranded. And now how often am I gonna expect it by chance? And the reality is, depending on your thresholds, you're gonna see predictions every, for each profile, you're gonna see predictions everywhere from one every 500 base pairs to one every few thousand base pairs. So when you say you're gonna analyze your gene and you might take a 10,000 base pair gene, even in the best cases, you're probably gonna make four or five predictions of binding sites for any given transcription factor. And for some transcription factors, you might make 100 predictions within them. And that leads to the ugly, which is that if you scan a piece of DNA with these things, you're gonna see predictions all over the place. And so we have a specificity problem. And we know from the good part that the specificity problem isn't really inherent to the matrix itself because the matrix is reflecting what the protein will stick to. What we know now as we think about this going backwards is that in order for that protein to stick to that piece of DNA, it's gotta be able to get there. And so if that piece of DNA is locked away and sequestered under chromatin structure where the proteins can't really get in there, it's not gonna stick. And if it's in an open and accessible region, then the protein's gonna have a chance to get there and hang around. So when we use these types of methods, we're working in the dark unless we combine them with additional layers of information. That's pretty important. Okay, this is my futility conjuncture that says binding site predictions are almost always wrong. And there've been all sorts of papers written all over the years, still a few coming through today where people will say, okay, I'm gonna take some bulk set of transcription factors, I'm gonna scan some bulk set of genomes and I'm gonna make some prodigious claim about how important something is and the reality is you're reflecting a bunch of junk. And so you're reflecting some other property. Okay, there's a conundrum with this data. And that is that counter to intuition, the ratio of true positives to predictions fails to improve for stringent thresholds. So what's that mean? It means that there's a natural tendency any time you're using a tool, particularly in informatics tool, to say that if I take a more stringent cutoff, I'm gonna get better results. So I will take less, but they're gonna be better. And in transcription factor binding sites, that actually isn't terribly true beyond a certain point. And so what you see in transcription factor binding sites is that for a while, you do get an improving positive predictive value, meaning the proportion of the predictions that are correct. But there's a point where you no longer get an improvement in positive predictive value. So why is that? It's because there's two things. One is transcription factors don't have to have the perfect binding site to stick to it. So they can stick to a sequence with a less than perfect binding site. It just means they're binding with less energy. But that doesn't mean it's any less biologically relevant that that transcription factor is getting there. And in fact, there's literature to show that sometimes these transcription factor binding sites have been tuned to get the transcription factor at the right amount, on the right levels there. And so functional sites may very well be evolutionarily selected to be less than optimal and still functionally true. So that's one key piece. There's one other piece to think about. And this relates to the nature of how transcription factors work. And so there's a tendency with transcription factors to think, going back to our very first slide where we were talking about them, that they come from outer space, they land on the genome at exactly the right spot, and they decide whether they stick there or not. Now, that's not how it works. So how does it work? I had a science illustrator come into my lab a couple of years ago, and she made this really cool video. So if you're looking at YouTube at all, you can go and look up Stroma videos. I'll show it later just for fun. But basically, the way transcription factors work is they load onto the DNA in a non-specific fashion. And so they're interacting with the helical backbone of the DNA. They then slide along the DNA, essentially along the backbone, engaged in the DNA in a non-specific non-specific energy format, sticking to DNA. When they get to a point where it's convenient for them to bind, they will have the potential to bind and interact with the bases inside the helix and stabilize and be there for a while. So what you're seeing here is a paper by Quake and Miracle. And what they did was they predicted the binding energy using matrix models, basically, on the x-axis. And they measured it. This was an exquisite experiment. They actually measured the binding energy for each of these different sequences using a microfluidic system that they've developed. And what you see is that there's a correlation between the predicted binding score and the measured binding score up to a point. And then all of a sudden, you see that it sort of flattens out. And what you're seeing there is... They didn't state this in their paper, but I'll tell you what they're seeing there. What you're seeing there is the non-specific energy. So at some point, the binding site is not good enough and you're engaging in the DNA with a non-specific energy. And so your matrix models are improving, the positive predictive values improving through this stretch because you're going transitioning from a non-specific interaction to a specific interaction. But across here, it's really about what's the functionally most desirable sequence? What's the genome doing? What's the sequence variability? What's the mutations that are accumulating? All sorts of other issues about where it's gonna fall within that data. Okay, so now you have a better understanding about where that data is coming from. Just to scare you a little bit, there's a very cool researcher in Sweden. His name's Johan Elf. And he has actually been visualizing transcription factors at a single molecule level in the nucleus to see how they are interacting with DNA and also naked DNA. And what he's shown is that they essentially don't stick. So transcription factors, despite the way we think about them and our whole view of them sticking is that they stick in the cell, they stick for microseconds. And so the question becomes, what's this whole view of transcription factors that we have when the evidence is saying that they stick for microseconds onto these spots? And so there's a whole new wave of research going on or being initiated to say, okay, are they really just coming in here and sort of maintaining some sort of epigenetic state on the region so that you have a continuing flow of transcription factors over the region? So this is- It's not so particular about that so long. No, that's just that sticking to the backbone versus sticking inside the helix is all that you're seeing here. In those studies, they engineer the specific target sites of the factors. Now, who knows if that's reflecting multi-protein complexes, whether you might get some additional stabilization effects and that multiple proteins might stick together longer. But by and large, it seems to be, when you get to the biophysical crowd, the trend right now is to think about these things as very brief visitors to a location. Okay, so what have we learned in this section? Matrices reflect in vitro binding properties pretty well. Suitable binding sites occur far too frequently to reflect in vivo function. And bioinformatics methods that use position-specific matrices for binding site studies are gonna have to bring in additional information. Okay, I'm gonna flip through these slides ultra-fast because it's a slightly older methodology, but it's just to convey to you how that you can use methods to filter on. So phylogelastic footprinting is a conservation-based approach to say, okay, some regulatory sequences are conserved over evolution. So if we look at conservation patterns of a gene, you can see that coding sequences are well-conserved and then you see some other evidence that certain regulatory sequences are conserved. The data right now suggests that maybe a third of regulatory sequences have a strong evolutionary component and about two thirds seem to be very highly variable across species. Depends on the factors and depend on the context, but by and large only a portion of them are conserved. This just shows you the futility plot again and says that if you filter that to focus on those regions where there's a strong pattern of conservation, you can greatly reduce the number of predictions and you can focus your attention a bit. And this says that you can get rid of about 90% of the predictions using that type of methodology. And you can use FASCON scores for doing this and there are tools online that you can use for doing this. There's some links in the system here to some of the tools that allow you to do that type of analysis. Okay, and there's some additional tools that are available for doing those things. Okay, now the same concept for phylogenetic footprinting can be used with epigenetics data. And so there are additional tools that allow you to filter based on either DNA hyper sensitivity open regions, shipped regions, and the like to focus your attention to the key regions. Now, let's, sorry for that fast one, but that's a quick foray. I wanna now take a few moments to introduce you to the discovery of patterns and then we're gonna dive into a lab where you're actually gonna do this. So perk up and pay attention because you're gonna need it for a minute to do the lab exercise and then we'll get underway. So de novo discovery of transcription factor binding site. So you have some chip seek experiment where you generated your high throughput sequencing reaction. You pulled out your regions that are bound by your protein and now you wanna know, okay, what is the pattern that this protein sticks to? Can I generate a matrix for my protein? So given a set of sequences, find the pattern that's enriched within those sequences. In large part, there are two primary types of tools that do this. There are string based tools and profile based tools. Depending on the species you're working on, you may find more tools of one kind than the other. I'm biased to profile based tools but the string based tools can have been proven to be effective in a similar way. So basically in a string based tool, you're looking at overrepresented algomers. So basically you say, let's look at every possible combination of letters. Determine this enrichment of that string of letters in the foreground versus background and give it a P value to that pattern. In a profile based system, you're rather than looking at each possible string of letters, you're looking at a matrix based model where it's a quantitative representation. So the difference here is really whether you look at your data through a word or through a matrix. And I'm gonna go into each one of these in some more detail here for you to get through. Okay, and then we have some issues of assessing what's the right pattern. Okay, so let's take a look at string based methods. These are the oldest. They are given renewed strength because of advances in memory and compute power. So basically what you can do is you can take a string of a certain number of characters and you can test every possible string and you can do it now with not just A, Cs, Gs and Ts but you can also use IUPAC degeneracy codes. So you can use Rs and Ws and the like to say which patterns show up more in the foreground than the background. So for instance, how likely is it to find a X number of words in a set of sequences given the background? So it's a similar concept, foreground background that we talked about before. And so what you'll do is you'll say here's my sequences. Let me count how many times I've seen each of the pattern. Okay, so the first thing that you need to do is that you need to have some sort of sense of background. And so in background what you'll do is you say, okay, here's the type of sequences that I'm looking for. So for instance, if you're working in yeast you might take all the promoters within the yeast genome and say what's the background frequency of each possible code. So you'd say, okay, and all across all yeast promoters I find T, T, T, T, T, T, T, T, 57,788 times. So now if you find T, T, T, T, T in large numbers in your chip seek data set, you may not be so impressed because it occurred 57,000 times in yeast promoters. Whereas if you find an equal number of A, A, A, C, C, T, T, T in your chip seek data, well you know in the background it was only there 456 times. And so you're finding a large number of these in a data set is gonna be more meaningful than finding an equally large number of these in the data set because you know the background is high. I see a couple confused looks. Essentially it's a lookup table process where you say for each pattern, A, A, A, A, A, C, A, A, A, G, A, A, A, T, how many times do we find that pattern? In the background, and then you do the same thing with your data set, you say A, A, A, A, A, A, how many times did I find it? And then you calculate a P value for it. So you have a measure of the variance of the patterns and you calculate a Z square, essentially, how many standard deviations are you away from the frequency of the pattern in your data set. Okay, why am I not a big fan of string-based methods? We lose the quantitation. So basically transcription factors show strong biases for certain patterns and we lose that characteristic when we use string-based methods. So if you use an NRC code like a W, you're saying there's a 50% A and a 50% T, but in the reality of the transcription factor you may see that it's 95% A and 5% T. And so you have a stronger pattern to look for when you work with quantitative approaches. But the string-based methods have been in particular extremely popular in the study of microRNAs and RNA binding proteins. So they've been shown, most of the tools that use for microRNA analysis are string-based methods. So if that's your world, string-based methods are probably what you're facing. There's a couple of links to that. Okay, so let's look at a matrix method and in particular let's look at probabilistic methods and then I'll show you the expectation maximization variant of it that you're gonna use in the exercise in a couple of moments. Okay, so the matrix-based methods, we wanna find a local alignment with X of sites that maximize information content or some other related measure in a reasonable time. And almost all of these methods are either an expectation maximization or a Gibbs sampling method and I'm gonna walk you through a Gibbs sampling version and I'll tell you how EM is a simplification of that. In general, the profile-based methods can look for longer patterns than string-based methods because each time you add a character to your string, you're adding four times more requirement for compute and memory. And another piece of this that's useful is you can create certain influences on the process which you're not gonna use. Okay, so here we're gonna work through our problem. So imagine that you have a set of sequences that have come from your chip seek experiment and you are going to try to discover the pattern that's within them. So we're going to have the sites that you've called within the sequences. You're gonna have the locations of those sites within the sequences and you're gonna have the sequences themselves. So that's your data set, the type of data and the information that you're tracking in the method. Okay, so if I've got it on the, there we go. Okay, so what we're gonna do is we're going to, in a Gibbs sampling method, what we're first gonna do is we're gonna guess where all the sites are in the sequences and that sounds silly, but we're just gonna guess. No truth to this at all. We're just gonna say, okay, one side here, one side here, one side here and we're gonna randomly place them. And then what you'll do is you'll say, okay, for one sequence, let's build a model based on the sequences we guessed and the model is mostly gonna be junk. And then you're gonna say, okay, let's throw out one of our guesses. Okay, well let's take a sequence. We'll throw out one of the guesses and we'll score that sequence with the model that we've created. And so now we have matrix scores, like we talked about before, along that sequence. And that junk model is gonna be somewhat noisy. There's gonna be peaks in several places. It's gonna be a not very strong pattern. And then what we can do is say, okay, now in the sequence that we're working with, let's choose a new site within it. And so if you, you can choose the site probabilistically and say, okay, we'll choose a site based on the area under the curve of the peaks. Or you can choose it under sort of an EM method where you say, okay, I just want the best peak. So you'll take the best peak here. If you're doing a full EM system, you can, rather than guessing the initial positions, you can take some sort of base string that's the most abundant string as your initial guess. Now what happens is that most of the time you wander in the forest, right, you're just bumping into nothing and you're not getting anywhere. But occasionally you get a site that's a real site that's enriched in the sequence. And what that does is that little bit of bias of having that real site in the data in the matrix makes it very probable that the next sequence that you work on, you'll get something that's a little bit more like that. And then when you've got two sites that's in the sequence, in the sequence set that look like it, that matrix gets more specialized. And so while it wanders around for a long time without getting where, every so often it gets a couple of sequences in there and it zooms right in to the pattern that it's looking for. And so when you're running in these types of probabilistic methods, what you'll see is that you launch them 10,000 times. So you start them 10,000 times because they're gonna, and you're gonna run them for a relatively small number of cycles and they're going to stabilize but each time it's gonna stabilize on a different pattern and you keep the strongest patterns that it stabilizes on. With expectation maximization, it loses that random step to it because you're gonna choose the sort of initial seed sets that are the best. And it will give you the same result every time, which some people prefer but it doesn't generate as much variability as weaker patterns come through. Okay, so here it is once more. Take a bunch of sequences, randomly paste binding sites on them, generate a matrix model out of those binding sites, remove one of the sequences from the set, erase the binding site that was used there, store it, choose a position within that to take the binding site for that sequence from, return that to the pool and do it over and over and over again. If you are, okay, so I think you have a sense of how that's working. So these types of methods are guaranteed to return optimal pattern if repeated sufficiently often. Of course, we may not repeat them sufficiently often because we don't want to spend that much time on them but by and large they do very well. You'll run it many thousands of times to avoid local minima and we are constrained because if the median transcription factor binding site or whatever pattern you're looking for is not strongly enriched in the data set, you're not gonna get it. So if your data is lousy and noisy, you're not gonna get a magical pattern back. If your data is clean, then almost any of these methods are gonna work. If your data is somewhere in between, you'll sort of have to optimize the process to get the best results. Now, you've all got this on your handout so it's not a secret but basically the test to see how all these types of things that are doing is usually ad noise. So here you can take a binding site from F2 to a fairly specific type of transcription factor and we can test it with increasing amounts of flanking sequence and we can see how long we can go until we lose our ability to recover the pattern that we started with. And the unfortunate reality is that by the time you add about 500 base pairs of flanking sequence you're getting pretty far down on your capacity to pull out the pattern. So these transcription factor binding sites and the discovery of the patterns is not hugely tolerant of noise. You need your data sets to be relatively clean. When we were doing chip-chip type experiments, it was a mess because chip-chip studies didn't reveal, didn't focus the data very well and so you'd end up with very large regions. When you're doing chip-chip sequences they do a much nicer job and so while there's some peaks that are very large we have a relatively small high confidence zone around them which is quite good. And I'll show you a figure a little bit later on just to convey that after you've developed gone through the meme exercise. Okay. So how do you improve sensitivity? Better background models help. You can use conservation if you so choose. You can, there's some advanced methods using combinations which I'll mention tomorrow in our systems approach. You can constrain the analysis type but primarily is to focus on chip-chip data sets. And so almost everything you see now is gonna be on chip-chip data sets. Okay. Okay, so our focus for the afternoon for the next period of time is to look at motif over representation. So in the morning section we were talking about motif discovery where we didn't know anything about our motif. We just had a bunch of sequence and we wanted to see what new patterns came up from that sequence. But as we move along in the world of the study of regulatory sequences in the genome increasingly we have an idea of what most of the transcription factors are and we have a better and better idea of what their binding sites are looking like. And so in 10 years time we really shouldn't have to discover patterns from scratch. We should be able to just use known patterns for our analysis. So there've been a variety of tools created that now instead of looking for the new patterns just say are there old patterns that are enriched here? So we're gonna take a look into that world. So inferring regulating transcription factors for sets of co-expressed genes. The input here can either be gene names, which we'll work with so you can take your gene list and plug it in. Or you may choose to restrict your attention to certain regions of genes. For instance, you have chip data of epigenetic chip or DNA hypersensitivity or phantom enhancers or other types of things that lead you to focus on those specific regions. Most of what we'll talk about is gene name focused. And then we'll do an exercise with some of the other ones as well. Okay, so in this context, we have some co-expressed genes and we have some sort of background. So we're gonna have some idea just like before that there's some foreground and there's some background and we're trying to say is there something different about the genes in the foreground compared to the genes in the background? And so what you're hoping to say is that there's some known motif and then if we look, we're gonna see lots of binding sites predicted for that motif in the foreground genes and not so many in the background. Now, as I've told you earlier today you know that it's not gonna look like this, right? Because these binding sites make predictions all over the place and so you're going to have an awful lot of noise in this process and so it's gonna be much more a matter of statistical enrichment rather than the obvious visual enrichment that you're looking at on the screen. Okay, so you're gonna hear a lot about go term over representation analysis tomorrow and the next day and in those conversations you're gonna hear be given some introduction to the base statistics that are used for these types of analyses. I'll mention them here briefly but you'll get them in more depth than you probably desire tomorrow about the statistics. Okay, when we do this there's really two distinct statistics that we are using in this analysis. There's a third that I'll mention on the sequence-based methods but on the gene-name-based methods there's two major ones. So the first one is you simply saying are there more genes with the pattern in the foreground than there are genes with the pattern in the background and that works when you have relatively clean data like this there's relatively strong motif. You're focusing on relatively short pieces of sequence next to the genes. But for many factors they are not that specific and if you're looking at slightly longer sequences you're gonna have a chance hit on large numbers of them and so another statistic that we're gonna use is measuring a measure of enrichment which is a measure of how many total binding sites are in the foreground set as opposed to how many total binding sites are in the background set, converted to a rape so that you have it per base pair so that if you have longer sequences in one set obviously you don't wanna have that outweigh. So what's that do for you? Well for strong patterns it might pop up best in this measure and for weaker patterns it may come up best with this measure. So for doing this we're gonna use a Fisher test essentially sort of a classic two by two table test on the right hand side and on the left side we're gonna be basing ourselves on a hyper geometric testing procedure. So we'll calculate a Z score and standard deviations for the number of occurrences and we'll use the number of genes with the Fisher. Sorry I got them backwards when I mentioned. Okay so I'm gonna describe to you the opossum tool which is the tool that my lab's created not because it's the best tool in the universe which it is but because it is the tool that I understand better than anything else. That said when I tried it out a few minutes ago the web server wasn't working for opossum and the lab staff was trying to bring it up so we may be using PCHIP which is another tool of similar venture and you will also find these types of enrichment analysis tools available on other systems. So they all generalize to the same concept it's just a matter of which interface you like better to get there. So what's it do? Well it takes its input in your set of genes. It goes to the ensemble database and pulls out the sequences and it has those available. It then scans those sequences with the database of well in this case we use a the base opossum system is a phylogenic foot printing system so we focus on conserved regions in the opossum system. So we align the human and mouse sequences or in this case we actually use the multi sequence alignments that are available and calculate a conservation score focus on the conserved regions. We pull out and scan it with the database of transcription factor binding profiles so we take all of those opossum matrices that we can get. We scan through and say how many times do we see binding sites and where do we see binding sites. Then we calculate a statistic of significance and we return a rank list of potentially mediating transcription factors. So the types of results that you will have will differ depending on which statistic you use. So each statistic will favor a certain type of transcription factor more than another and so in general it's good to look at a couple different ways of thinking about your regulation rather than expecting there to be a single ultimate ranked list within the tool. So this is an analysis with a couple of reference gene sets. This is a set of skeletal muscle specific genes and these are classic skeletal muscle transcription factors. This was a set of hepatocyte specific genes and these are classic hepatocyte transcription factors. So when you give it a perfectly clean date list it does a beautiful job. Like most tools when you give it noise it's gonna get worse and worse as it goes along. So it's really a matter of how far can you go or still get things that you like. We've been representing it more and more using these types of plots where you see the two different metrics along the different axes so that you can plot your results using the Fisher P-value along one axis and the Z-score along the other axis. And in this case the optimal positions are up towards the left. And so those things that are coming out. Now what we ran here were three different data sets and so it's a little confusing because the green one was related to in F-cap B regulation and you see in F-cap B profiles coming out. The pink one was related to hepatocyte regulation and you see hepatocyte regular profiles and the blue ones was related to muscle and you see blue muscle related ones. So there's three different data results on the same plot. How would you interpret what might it be assigned? So what you would see is that you might see a Z-score that is striking, a Z-score that is striking without having a strong Fisher test result. If you have a profile that has got a one gene gives you a saturating number of binding sites for it. So for instance, you had one of your sequences that just had the binding site over and over and over and over and over again. It would be there a total in your data set as a very large number. But it would not show up as being enriched across all the genes that you're looking at. So those types of biases can push you up into that end. In general, the opposite is you don't see so much and I don't know where it would come from. It probably, I have to think about it, it might be the opposite end. So you might have sort of an extremely high information content profile meaning that it makes almost no predictions. So that it wouldn't have very many but the fact that it had any in the one gene might be enough to skew the results and have to try it out and see if we could skew the data to get there. Okay, so the opossum server that we'll use in a little while hopefully has a bunch of different flavors to it and we're gonna try a very simple version of it within the system. So basically because it's a bunch of pre-computed things you choose the organism that you're interested in working on because it has a database behind it and then it has a series of different options in terms of types of analysis. So we're gonna focus today on single site analysis which means we're gonna look at one pattern at a time. So each pattern within there is treated in isolation. An anchored combination site analysis means that you say there's one factor that I already know is interesting and I'm gonna look in the vicinity of the binding sites for that factor to see if I can find anything else. And so for instance you might have chip seek data that you're looking at in the sequence analysis version of it. You might have chip seek related data and you already know that you did an antibody for that transcription factor and you wanna see if there's another binding site for different transcription factor that shows up in your bind. This one is based the TFBS cluster analysis and the anchored clustered analysis are both based on searching for sets of transcription factors that are showing up as enriched. So it doesn't, you don't necessarily have to start with one that you're interested in, but it doesn't. Both of these are computationally extraordinarily slow and if we run those we'll crash the system completely today. So we'll focus on the first two. Is sequence based for all other organisms or? Yeah, so sequence based means that you provide the sequence and so it's agnostic about organisms. It doesn't care what organism you have. You just give it sequence and it will run on it. And how, what is it a magic for the transcription factor? Yeah, so you're constrained by the databases of transcription factors. Within the system here, we're using the Jasper database of profiles and that one has human and mouse basically are a shared set within that. So we treat the human and mouse factors from the same database. There's a small set of fly profiles and even smaller set of worm profiles. Basically blast against all your branches against all but no one transcription factor? For the human and mouse it goes against the vertebrate transcription factors. But the sequence based? For the sequence based you to define which subset that you want to work on. So we'll try that out in the integrated assignment today to go ahead and try the sequence based analysis. Good questions. Now, we've talked a little bit about backgrounds earlier on motif discovery and I will tell you that motif matters just as much when you're coming in looking at motif enrichment. And so what you're seeing here is the same foreground data set with three different background data sets. What you're plotting along here is the Z score. So one of those two statistics along the Y axis. And what you see plotted along the X axis is the GC composition of the profile. So did the logo have a lot of GC or a lot of AT in it? And this one has an elevated GC in the background. And so when you have a lot of GC in the background GC patterns don't seem as important but AT patterns become extremely important. And when you have the converse and you have a low GC in the background, meaning it's a low GC content overall, you see the same characteristic. And so what you really wanna do is have a matched background so that the background that you're using is consistent with the foreground that you're using. Now, that has been facilitated by some tools that have been created in the system for now so that there is an option in the system to generate matched backgrounds. And you can take the matched backgrounds out of the opossum system and use them elsewhere so that if you want the matched backgrounds, you can do it. Most of the tools don't do anything to correct for the backgrounds. So what you'll see when you look at results, imagine that you ran this one and you took this set as your analysis and your foreground, your background looked like this. What you would then see, and this was a chip seek experiment against a transcription factor called NFE2-L2, or also known as NRF2. And what you would say is the best scoring profiles were this one and this one and this one because they were AT-rich profiles against a GC background. So the fact, the profile that we're looking for gets buried fourth in the list coming down relative to the others. Likewise, you can see that there's a skew here, but in this case, the NRF2 profile was sufficiently above it that it would still come out as being heavily enriched against the background. Now, what's key here is that there's a secondary factor that also acts on a subset of NRF2 binding sites. It's called AP-1. It's a common factor called it's known as C-June, C-June Phos. And what you see is that the AP-1 motif, if you're running against a GC background that's with a high GC, it's so far down in the list you're never gonna get there. When you run on this one, it's still pretty far down in the list and you might or might not get down far enough to give it any attention. But after you get the background straightened out and balanced and corrected for you, you're gonna see that emerge as being higher enough against the noise to pull it out. So this type of problem is common within the analysis of enrichment of regulatory sequences. And the reason is that we have a great disparity in CG composition around gene promoters. And because of that great diversity of CG composition around gene promoters, certain sets of genes will have very different characteristics than other genes. So for instance, how many of you have heard of CPG islands? Does anyone wanna tell me what a CPG island is? Good. So CG dinucleotides are targets of methylation systems. So what happens in most of the genomes? What happens to CG dinucleotides? Yeah, so in most of the genome outside of immediate promoter regions, CG dinucleotides are methylated. Promoter regions subset of them are protected and have no methylation on them to keep them open and accessible. Now over evolutionary time, so this is over vast periods of time, over evolutionary periods of time, CG dinucleotides have a tendency to mutate. So if they're methylated, they have a tendency to mutate. So you'll get a C converting to an A, if I remember correctly. And so what happens is that you selectively bias against CG dinucleotides across most of the genome. So you're wiping, you're eliminating them over evolutionary time. In promoter regions that are active, you have a tendency to protect your CG dinucleotides. And so you have CG dinucleotides that appear to be there at a higher frequency than the rest of the genome. Now the biggest common misperception is that you actually have higher CG dinucleotides. What you have is a reduced tendency to mutate your CG dinucleotides in promoter regions. Now, genes that are narrowly expressed late in differentiation. So genes that turn on as tissues mature and regulatory sequences that are used after as tissues mature, tend not to have such strong CG island characteristics. And the reason for that is that the genome as we see it is established in the germ cells, which are established in the first few divisions, relatively first few divisions of the developing embryo. So those things that are turning on late in the process are actually methylated in the early phases and are accumulating mutations as fast as the rest of the genome. So that's a long-winded way of saying that actually when you get into promoter analysis, CG dinucleotides really impact you on human and mouse studies because the genes that we usually care about for most of us are genes that are involved in some sort of tissue or some sort of response or something where they turn on in a specific time later on. And so we do have much greater diversity in CG content than we would, than we might expect to have. Okay, a challenge comes with this. So just because you have a name attached to a transcription factor that shows up in your list has no real information about whether that transcription factor is acting on your set of genes. And the reason for that is is that with some exceptions, most transcription factors in the same structural class bind to very much similar sequences. So if you see a pattern like TGA, CTCA, well, it could be June, which is a AP1 transcription factor. Could be June B, could be June D, could be FRO, could be FOS, some of the ATFs bind there, and so on and so forth. They're all Lucene-Zipper class transcription factors. They all have very similar binding motifs. And the fact that that motif was enriched doesn't tell you which of that group is the one that's potentially acting on it. So you have to abstract your list to the context that you're looking at. So if you run your enrichment analysis and you say, okay, I got my results back, and I see this profile at the top, it's not to run off to your publication and stick it in your paper that says that was the factor that's acting on your system. It's to say, okay, that family of factors could have a role in here. And now I have to say which factor is likely to be active in the cells that I'm looking at and could likely be mediating my response. So then you have to return to your expression data and your knowledge of the system and see which of the factors that might be most relevant are there. There's a massive exception to that, which are the zinc fingers, which essentially are each their own unique binding characteristic. And so if you're hitting on a zinc finger transcription factor, you probably have a reasonable chance of sticking to the profile that you're seeing. So this is just an example of the ATFs family of transcription factors. And these are seven profiles for distinct members of the ATFs family. This one is, if you reverse compliment, you'll notice it's going to look awful like all the rest of these. And this is sort of the common essence of ATFs, just merging them all together and trying to show us what's there. So there's been some work on building the classifications of transcription factors and organizing them into hierarchies so people can now go back and work their way through the transcription factor sets and see which factors are in groups. And that's just a report on all the ATFs related factors in the system. And so if you get on an ATFs profile, you really need to do some extra work to figure out which one it is that's mediating. Okay, so what have we covered so far? There's tools to help interrogate the meaning of observed clusters of co-expressed genes. I didn't say this, but I'll tell you now that this is not perfection. And so if you have noisy, lousy data, you're not going to overcome it through this type of analysis. And so what's noisy, lousy data? Noisy, lousy data is to say, I'm going to throw a growth factor on my system and in 14 days I'm going to come back and see which genes are expressed. Because you're looking at not a primary response, but you're looking at a secondary and a tertiary and a coturnary response layered on top of each other. And so when you're looking for regulatory cues and regulatory signatures, you generally want to be looking in relatively short periods of time after some sort of action. So if you're looking at a differentiation, you'd like to have as homogeneous a system as you can so that you're looking at things that are more or less equivalent to each other. If you're looking at some sort of activated response where you're treating yourselves with something, you'd like to look in relatively short time periods, meaning depends on the system, of course. I've seen everything from an hour to 16, 24 hours be successful in these types of analysis. But when you start going out multiple days, you generally have too much noise in your system to pull things back. Okay, so we're transitioning into regulatory networks and regulatory networks is a gigantic field and we're not going to get to super advanced methodology in this class because mostly it's pretty much programming access to do it. But what we're really going to focus on when we talk about regulatory networks here are two phases, two pieces. One is looking how sets of factors are acting together. So we're going to look to see if we can find multiple factors that are contributing to a regulatory control system. And then we're also going to look to see what are the commonalities of the genes that are subject to the same regulatory program? So how do we take a set of genes that are sharing a regulatory signature either, in this case, sharing a chip domain, but also you could have the same methodology applied for code gene expression. She'll see later in the day. Okay, so we're going to seek insights into networks through the analysis of regulatory sequences. I've mentioned the two major concepts that we're going to poke, but just once again, cooperativity of transcription factors, so multiple transcription factors acting together and then identify biological networks that may be associated with genomic locations. And we'll try to cover the basics in this section and then we'll try to cover the applied in the next, in the lab steps. A few follow-ups from yesterday just to cover a couple of the details that were going on. So there were a few questions yesterday related to the scoring statistics. You're going to get them again from Quaid in terms of the different statistics that are used, but loosely, the Z score that you were using yesterday is comparing the rate of occurrence of a binding site and the target set to the rate of occurrence in the background genes. And the Fisher score is comparing the proportion of genes containing a binding site to the proportion of the background set. So this one is, you know, how many binding sites are there? And this is how many genes have binding sites. So sometimes if your background set doesn't have one of your target set genes, the single-side analysis says you're going to get like an infinite Z score or something that doesn't make sense in those cases. Is it worth trying to rerun the analysis? Like if you're trying to be most comprehensive about your... There's sort of two ways that you end up in that situation. So one is either that you have an extreme profile and those ones are very hard to make sense of in these types of tools, meaning that it's, you know, wide and strong patterns and you only see a pattern once every 50,000 base pairs or so. And the other piece that contributes to that is if you have a very small set of sequences that you're analyzing. And so in general, you can try to rerun it. You could lower your threshold slightly and you could see what's happening with those ones. But my general statement would be you're probably not going to get any reliable data out of it. So I'd probably just skip them. Okay, there were also some domain-specific requests for information. So people who were interested in doing regulatory analysis are a couple of different species. So the two that were mentioned there was a question on insects and there's a tool that I quite like for flies. It's called ISIS Target. So the ISIS Target is a package developed by Stein Ericsson Belgium and it's a software package that you download but it runs very easily. It's nicely maintained and it allows you to do a lot of the same sorts of things that we were doing online for fly analysis and it's just a very convenient tool to use. The other group that was a request there was an interest in bacterial work. I know that RegulonDB is probably an aware tool out there. It seems to be the richest one for the microbial bacterial world. What RegulonDB really is doing for regulation analysis is it's wrapping a package called RSAT. So it's the Regulatory Sequence Analysis Toolkit. And RSAT can be applied to any species but incorporates the information in RegulonDB for binding profiles. So if you're bacteria or fly, there's a couple of resources for you. And if you didn't ask me yesterday for resources and you have a particular domain that was underrepresented, make sure you ask me today and I'll look it up and put some information up for you on the system. There was also some conversation yesterday around transcription factor grouping. So this idea that you get a signature pattern that says, okay, it looks like it might be a forkhead transcription factor but then what are the forkhead transcription factors? So there's a few places to go to get some more information about transcription factors. The most relevant for that particular problem is a system called DbD or DNA Binding Domains, which is available at transcriptionfactor.org. And that's a effort to curate all the DNA binding transcription factors into subclasses. It was originally developed for flies but it's then been extended to essentially all species that have an ensemble genome. So it's maintenance comes and goes. So I think that it's run by a lady named Sarah Teichman in Cambridge and she's, I think, working on a new release. So I'm not sure where we are right now but it's coming along. TFE is a Wiki project where they have, where we have, this one's I run, individual little review articles written by experts on the field. So if you have a chance to log into that, I noticed the system was down this morning because we're transitioning servers so it'll be up again shortly. And then the factor book is the encode-related project where they have the profiles and information that's been generated out of the encode project so you can find a little bit more information in there. So just three additional resources for transcription factor information. Okay, so one of the things that's emerging out of the encode project and really where the state of the field is right now is how do we incorporate all these layers of information into the same type of analysis tool? So what you'd really ideally like to do is to say I have epigenetics data, I have DNA accessibility data, I have DNA binding dope data, I have co-activator data, I have polymerase complex data and you're getting all of this different information conservation data. And so you're getting all of these pieces thrown together and you'd like to say, can I come bring those together and have a interpretation of what regions in the genome are open and accessible and likely to be functioning as regulatory regions? It is early days for those types of tools. So these are not friendly. These are not polished web interfaces. They are dependent upon the data that's available. The different tools will have incorporated different pieces along the way. So I'm not saying we're not gonna do a lab on these because this is sort of the bleeding edge where things are in the early stages. But if you're interested in those types of things, these are sort of the two better tools out there for incorporating this information. So they are related. In fact, some of the authors actually appear on both papers. The general idea is can you predict active regulatory regions in a given cell or tissue based on integrated analysis of diverse genome scale data? And the two tools that I highlighted here are Chrome, HMM and Segway. And so these tools are ultimately about segmenting a genome into different classification groups. So they will say, okay, here's a whole genome sequence. And I'm gonna believe that there's a series of states within that whole genome sequence. So a state might be a promoter region. A state might be a coding exon. A state might be an intron. It might be a distal enhancer and so on and so forth. And so you'd like to be able to say, okay, let's classify it all out into these different states in that given tissue. So both of these are segmentation type tools or classifiers in that way. They principally differ in the underlying statistical methodology. So Chrome, HMM as the name suggests uses a hidden Markov model for doing its work. And the Segway system uses something called a dynamic Bayesian network. So you probably don't need to know about the underlying methods within them. But they ultimately are trying to deal with the fact that you have incomplete data. So sometimes you have information in one region and you don't have information in other regions. And so the big challenge is here is to sort of work your way through this and say, okay, I've got this, this is the information I have at this particular spot in the genome and how do I bring it together? So you can go and look at those tools if you want to. They are downloadable tools where you can in theory, run them and train them. I have not heard anybody really outside of those groups successfully installing and running those tools and trying to do them. Mostly because they're, I think, so new and people are still finding their way to them. But I think these are the two best ones out there right now. If you want to try to bring a bunch of data together and you really want to push, push free envelope. So the data from the Chrome nature map that they're on a bunch of encode cell lines is available as a track in the genome browser. Correct. I think that's in the local apps. Yep, they also have attached reference pages where they've got the pre-computed segmentations for a few different sets. So because they're encode based, the two richest sets are for the sort of pure, I guess three richest sets are these tier one cells that were generated in the encode projects. So they're K562 cells and the, there's a lymphoblastoid cell line that was extensively used. And I think there's a, I'm not sure if it's healer with the other third cell lines. There's one third, there's a third set. They do do a few of the other ones, but those three ones are where you really have average data. Very human centric, so coming, I guess, for other species as we move along. If you're interested though, you can also read about them a little bit just to get a sense of where things we're gonna be in about three years time because these things will eventually emerge into tools where you'll say, give me a genome and now classify it into segments and give me a segment that is involved in hair follicle cells or the like. Okay, so I've mentioned that these were segmentation to system subclasses. You're trying to use data properties to subset the genome. And then both of the systems don't really take into account sort of training data where you'd say, okay, I'm gonna give you a bunch of promoters and teach you to find promoters. So that's sort of a classic machine learning style. Mostly what these tools do is they try to segment and classify the system into groups and then they look at their different groups and say what are the characteristics of those groups and do they have a high correlation with certain characteristics. So for instance, they'll say class 23 has these segments in the genome and when I look all my promoters that I know about fall into class 23, so I'm gonna call class 23 my promoter region. So the attachment of meaning to the classes usually occurs post training of the methods. They require specific data, so you need different, some of them are more flexible than others, but basically you have certain classes of data that they're tuned for and ready to take in. They're particularly generated on encode projects and it's still work in progress until it becomes widely available outside encode. So that's sort of the outer edge of the space and that's what you may see in the next couple of years coming along. Okay, now one of the things that we're going to do in our exercises today is try to find more meaning from our chip regions so that we can take a little bit more out of them. And so one of the tools that we'll look in a lab section today is can we infer pathways and networks and gene processes based on chip seed data. So the tool that we'll use when we get into that segment is called GREAT. It's a package that was developed at Stanford in Gilbert Geronimo's lab and it essentially takes as input a bed file so that you give it the coordinate locations of your regions of interest. And it then says, okay, based on the criteria that you use, the parameters that you select, it says what are the genes that are proximal to those regions? And then based on the genes that are proximal, it goes and identifies the annotations of those genes and then tries to say what are the pathways and networks that are associated with those locations. It is particularly good for those people with epigenetic or transcription factor data that highlights certain sort of relatively small subsets of the genome. It runs well and it has nicely incorporated an enormous number of data sets and data tools in its analysis. So once it's got the gene set that is there, it goes out and it sort of funnels through an awful lot of different data sources to give you a series of reports. So we'll do that as a lab exercise in a few minutes so that you have a chance to take a look at that and see how it works. Anybody with chip data which should be, is well served to be aware of it and give it a shot. I mentioned that it takes as input a bed file and then gives us output multiple enrichment measures. So when we get there, this is just a couple of screenshots to give you a sense of the sorts of things that it's doing. Kind of fuzzy today. So it's basically giving you a report on where these are relative to transcription starts, how many genes are found within the proximity of it and then it goes on from there to give you the enrichment analysis results for the network pieces. Okay, the other piece that we're going to try to do today which we'll do first, so I should have swapped these in order is to look at transcription factor interactions. So we would like to understand how sets of transcription factors are acting together in a system. So we have in the human genome, we have 1500 transcription factors. We have more than 1500 contexts in which we want to have expression. So the way the cell generates those different contexts is through interactive combinations of different transcription factors as well as modification of the proteins in certain ways. So understanding the TFTF interactions allows us to take what might be a large set of target regions and then to focus on a smaller subset of those regions that are sharing additional characteristics. So it's a way to start taking a big set and breaking it down into some smaller sets where you can then look for meaning on the smaller sets that might be deeper. The opossum work that you did yesterday really looks at co-occurrence when the anchored opossum analysis looks at co-occurrence. But there's actually an increasing evidence that there can be spacing rules that are much tighter than just that they're somewhat near each other. And so there have been observations now in several contexts where you actually see a physical relationship between where one side is and where another side is, which is an indication that there's some sort of direct physical interaction between the proteins. And so the best tool that's out there right now in my opinion for looking at those types and discovering those types of direct physical interaction distances is called SPAMO. And it's a tool within the Meme Suite. So you will have been familiar with the Meme Suite from yesterday's exercise. And so it's a relatively decent interface, although I had to work with them yesterday because it was broken on their motif handling piece, which is now fixed. Thanks. Thank goodness they're in Australia and so we can get some extra hours in overnight. So what SPAMO allows you to do is look for precise spatial patterns between binding sites. And it gives you a nice report whereby when you run it, what you see is you give it your primary factor. So you say this is the factor that I'm primarily interested in. And we'll use the same data sets that we used yesterday for studying a stat one. And then it gives you a report about what is the most statistically significant relationship that it finds. And it goes through a whole database of binding profiles. And the report that you get back essentially gives you the two factors that it finds a significant score out there in the upper right. And then it gives you a nice visual plot showing you where the co-occurrences are situated in a physically distance. So it masks out, in the middle, it masks out the binding site for the first factor, your primary factor. And then it plots where you see the next factor site. Now, one of the things you'll notice, particularly with stat one, is stat one's a palindromic site. And so yesterday when you looked at it, you saw there was sort of a double peaked characteristic in the central mo plots. That's just because you have sort of one base pair off depending on if you're looking at it, the forward strand or the reverse strand, where the edges of the sites are located. And so you occasionally see things that are essentially overlapping on the reverse strand. So you'll have a chance to look at that on the system and to take a look at it and get a sense of how to interpret it. I just want to tell you a few things about what will be coming in the future and to give you sort of a closing thought on what we've done and gotten out of the past day. So some of the things that we'll transition over the next few years, just so that you know where things are now and where things will come, is that the weight matrix models that we've been using here, so these sort of fixed matrix models are still the dominant one. You see them in all the tools right now. But with the richness of chip seek data, all of a sudden we have much better ways of looking at transcription factor binding sites. And those models are going to gradually be replaced by one of two different classes. They're either going to be an energy model system where we think about the binding of the factors in terms of their total energy, or they will be replaced with HMM models, which we use for protein analysis right now, but not so much for transcription factor binding sites. And the reason is that as we get richer data sets we're finding that there are variable spacing sometimes between half sites. We're finding that there's little edge characteristics so that there's characteristics that may be related to either the bending of the DNA. So you're getting these much richer characteristics of the binding sites. And these weight matrices can't capture those characteristics. And so right now the field is researching and working to come up with that next generation of tools. And so that transition will happen. But it will, the main tools will be PWM based probably for the next few years, next three years as we see that transition. But it will come. There will be more integration of the diverse data types. So I mentioned the Chrome HMM and the segue in the presentation. Those types of tools are being developed in massive numbers right now. And so there will be friendly tools that take into account data sets that will emerge over the next 18 to 24 months. And so you'll see a new generation of tools that deal with that. I mentioned yesterday the phantom project and the fact that we're going to see this later this year. We're going to see this release of this massive promoter activity data, which is this sort of deep RNA-seq. And it has this extra layer of enhancer functionality that's going to come into play. And so it's going to be another complementary data source to all the sorts of things that we've seen to date. And so it's going to have a pretty big impact. And there's a large community of people that are focused on sort of the three-dimensional structure of the nucleus. And so how do we take into account the chromatin confirmation and the like in dealing with regulatory sequence analysis? And so that's a domain that's growing very rapidly. And we're going to see an awful lot coming in the next few years. Some of the big challenges ahead, we're going to have to understand how all these different transcription factors are working. So most of the data right now has been focused on a few hundred of them. We've got 1,500 in the human system. We've got all sorts of species that have not been treated yet. So we have a lot of work to do. With the whole genome sequencing, there's a big interest right now in genetic variation in transcription factor binding sites. So there's going to be a lot of work trying to say, how do we determine meaning for regulatory sequence changes as opposed to exon sequence changes? There's the challenges of integration and the transition from one model to the next. OK, so what are the big highlights? If you're a little bit overwhelmed and you're trying to say, what did I really get out of this section, you learned a bit about transcription factor binding profiles. So you should have a better idea what those things look like, how those matrices are generated, that they're represented by an alignment of a bunch of sites, and that they are pretty good at predicting whether the protein will stick to a piece of DNA in vitro. But we also recognize that they are constrained because those models do not take into account any information about chromatin. And so they have no idea what's accessible. And so you have to combine them with other approaches to get to what's functional in a given cell. You tried pattern discovery. So you were able to take sets of sequences, put them into meme, and recover motifs out of it. So if you have a chip seek data set, or in some cases, gene list, you can go to meme and you can discover a new pattern that's overrepresented within that. You have some sense of how that worked loosely in the presentation. But by and large, you know that it's looking at enrichment. You try to opossum and you look to see that if we have this databases of known patterns that we can measure their enrichment in either a set of genes with gene IDs or in a set of regions with sequences. And that allows you to take this set of genes and try to connect them up to potential regulatory partners. And then you went further and you looked at how you might study relationships between transcription factors. So you looked to see if we could extend from looking at a set of genes and identify groups of transcription factors that can act together. And in the final segment here, you look to see that you could use chip seek data for a set of genes and infer and identify potential relationships, functional relationships, based on the great analysis. So I thank you all for your time and your attention and your willingness to explore together. You asked great questions. Many of you brought nice resources to mind. And I appreciate your engaged interest. And I wish you very well for the rest of the day. I think you're going to have a great time with Quaid, Lincoln, and Gary. Thank you all.