 And today I'm going to talk to you about how to use various tools to understand gene regulation, particularly gene regulation within some data set that you have. So at the end of this workshop, you should be able to understand challenges in predicting transcription factor binding, be able to identify binding sites for known transcription factors, and in the lab, we will delve deeper into discovering transcription factor binding motifs in genomic regions with MIME chip and other tools. So I have a brief overview here. This should be in your notes, and let's just dive in at this point. If you have any questions in the middle of the lecture, feel free to ask. So I'm going to start with an introduction to how transcription works in eukaryotic cells. So here's a really, really simple model of this, which is that you have... Yeah, so that there is a particular transcription factor binding site in the promoter of a gene. So here we have a transcription factor binding site. It looks like CCA, GGG, TAT, and there's a transcription factor that is shaped to interact with that particular binding site. And so that means that there is a good binding energy between the transcription factor and the binding site. The transcription factor is bound there. And then the transcription factor is where it recruits RNA polymerase 2, RNA polymerase 2. So it comes along and transcribes DNA downstream of this transcription factor binding site and transcription start site. So there's actually a lot more to it than that very simple model, because you can have a number of transcription factor binding sites within a proximal regulatory region or promoter. There are also things that you'll find in distal regulatory regions, both upstream and downstream of the transcription start site. And so there are a lot of different components to the taxonomy of transcription initiation. And remember, all we're talking about here is regulating the rate at which transcription is initiated. And of course, gene regulation in general has many downstream components that can be regulated, such as mRNA splicing or decay or other post-transcriptional regulation and even post-transitional regulation. And all of this has made it a bit more complicated because this textbook model of a promoter is something transcription factors influencing transcription at a promoter is slowly giving way to a more complex model where we see that the genome is folded around itself in three dimensions and you can often have important effects of distal enhancer regions that are nearby in three dimensions, but perhaps not very close along the one-dimensional measurement of the genome. And all of these things can affect how much transcription occurs at a particular place. So one nice thing is that now we have developed methods for measuring or observing these various elements that can affect transcription factor binding. So we have RNA-seq, which I'm sure all of you are familiar with, but there are also techniques that we can use to measure specifically five-prime ends of newly transcribed RNA, like cage and also use things like row-seq. And then there are techniques that we can use to figure out where transcription factors are bound to figure out where various histones have been modified in particular ways. So we can do that with chip-seq and we can do chip-seq both at promoters, at transcription factor binding sites, and also at some of these distal regulatory regions. So all of this can give us a more complete picture of what is actually going on in the regulation of gene transcription. A lot of this data has been collected by various large projects, so things like the ENCODE project, which has literally thousands of data sets of these types, Roadmap Epigenomics project, which again has many more. Those are both focused on human with some mouse, and in the case of the ENCODE project, there's a mod ENCODE for fly and worm data. And then there's a lot more data that you can find, say, in GEO from various smaller-scale experiments. People have done with these. So the take-home for this first part of the talk, the lecture, is that transcription can be very complex, and I just briefly touched upon the fact that there are now many measurements or observations of the various complexities of this. But in the next part, we're going to go back to a simpler model of how transcription factor binding works, and we're going to see if we can teach a computer how to find transcription factor binding sites. If I go on, any questions or comments on that in the first part? OK. So as I showed you earlier, a transcription factor has a cognate DNA sequence that it likes to bind. So in this case, we might represent some transcription factor by transcription factors binding motif or the site it likes to bind with this string, A-G-T-T-A-T-G-A. The thing about transcription factors is that their recognition code is degenerate. So things don't have to be exactly like this particular binding site that we've displayed here for transcription factor to recognize it. And often there's not an offer on for transcription factor. There will be lots of different sites or lots of different sequences that the transcription factor will bind, and it will bind some of them with higher energy and some of them with lower energy. So here you can see a number of different individual binding sites that this imaginary transcription factor we're considering has. And you can see this degeneracy because in some positions, like the first position changes quite a bit, the fourth position is always T and so on. So you can represent this amount of degeneracy with a sort of wildcard DNA alphabet. This is called the IUPAC ambiguous DNA alphabet where you can replace certain letters with other letters that represent their wildcard nature. So for example, this T, it's always T, so it says it's T. This T gets replaced by a W, which stands for weak because it means it's A or T. So if you look in the fifth column, you'll see that it's almost always T, except down here it's A. And then there are other things like V means A or C or G, D means A or G or T, R means A or G or so on. So there's a whole string full of letters for people like me to memorize. And for you not to because we have better methods for doing this. This particular method is a nice start, but it doesn't really represent all of the data that you would have in a set of binding sites. So for example, this fifth column here, I think every single one of the rows is a T except for the last row, which is A. So there's a lot of information there that's thrown out if you say this is A or T. In reality, we want to say that it's almost always T. And the further you get away from T, the less likely the transcription factor is to bind. So finally, I'll introduce a matrix, a matrix model that we can use to understand how this transcription factor interacts with its motif. And this is called a position frequency matrix. And what we do is we have one column in this matrix for every column in our aligned binding sites. And we have a row for each letter in the alphabet. And then we will simply count out the number of times that symbol occurs at that column. So again, fourth column, 21 T is always T. The fifth column is 20 T is 1A and so on. And so you can use this to have a slightly more nuanced understanding of what the motif is. And this can also be transformed into this sort of sequence logo, which I'm sure many of you have seen before. So the sequence logo is a graphical representation of a matrix model for transcription factor binding. And where you can see, for example, let's look at this fourth column again, there's a T here. The T is taller than the other symbol because the height gives you the amount of information that you essentially get from seeing that particular symbol there. And the next position, there's still a tall T, but it's a little less tall because sometimes one out of every 21 times this transcription factor is known to bind even when there's A there. So if you've ever seen these in an article or a presentation somewhere, it's really a representation of an underlying matrix much like this one. Any questions so far? So that's one matrix model. There are actually a variety of different matrix models, and the one that is used most frequently is called a position-specific scoring matrix or position weight matrix. There are a variety of different formulations for PWM or PSSM. So unfortunately just saying that doesn't tell you exactly what it is, but generally it is something like this. So for a position-specific scoring matrix like this one, what we do is we start with a position-frequency matrix, and we transform it into something that can be used more universally and something that includes a model of the background frequencies of the nucleotides within the genome we're interested in. So we have here FB, I apologize for the formatting, I think. This would be better if we have the PDF version up here. Is this the PDF? No, this is the PowerPoint. Yeah. There's two slides. Yeah, that's fine. I'll just, totally fine. I think it'll just be easier if we can see the, no, it got messed up in the video. Is this the PDF I sent or did you remake it? Okay. That's fine. It's just a Mac PC font issues. Anyway, it's no big deal. So what we have here is the FB, I here. That represents the number from the position-frequency matrix. So B is a base, A, C, G, or T. I is a column here, 1, 2, 3, 4, or 5. So we take that and then we add this SN, which is a function of the sample size used to produce the PFM. So intuitively, if you do an experiment to figure out a transcription factors binding motif, and you get six samples, and from that you infer that this position is always A, and the fourth position is C, 80% of the time, and T, 20% of the time. There's less information there than if you did this 100 times. So if you did this 100 times, you might be able to distinguish between this really being 80%, 20%, but really this being like 99% and this being an occasional error that you get. So this SN factor that you add here is something that allows you to account for that when creating the position-specific scoring matrix. The second thing that you do is you divide by the nucleotide frequencies in the genome. So a particular nucleotide, it will be surprising relative to what you would normally expect in the genome. If you're working in a genome that's very, say, AT rich, and you see an A or a T in your motif, that will be a lot less surprising than seeing a CRG. So a CRG would supply you with more information. And the final thing that we do, so we take the F, so the PFM result, we add this sample size factor, we divide that by the nucleotide frequency correction, and then we take the log of all of this. And the reason we do that is because it makes all of the math a lot easier. So you can transform what essentially this model represents is a probability, and we have to multiply probabilities together for each column, but if we can, instead of doing that, we can take the log of each probability, and then we can add them. So computers are a lot better at adding than multiplication. This may seem surprising, but it is still true. Even in 2015 they're better at adding, and it is a lot faster to add a bunch of numbers than multiply a bunch of numbers, and you'll get fewer other errors involved. So when we do this, we end up with a PSSM like this, so this is all log transformed. And you can see, for example, this first column where it's 5 for A, there's a positive number 1.6 for A, and you get a negative score for any other symbol, and so on. So if you want to score TGCTG, you would just look at the appropriate cell. So for T column 1, it's minus 1.7, G column 2 is 1.0, and so on. You just add up all of those numbers together, and then you will get a probability that this matches the particular motif here, or really a score. It's not really a probability because it's been log transformed, and so this score is 0.9. So how can we apply this to the genome? I'll show you a quick example here. Here we're looking at the, yeah, please. Yeah, so there's some typographical issues. Can I erase something here? So I was just asked to clarify what this actually is. So it's the PFM, the score from the PFM, and you add the sample size correction. You divide all of that by the base probability of the nucleotide in question, and then you take the log of all of it. So I'm sorry about the way this showed up. This always happens in the PC MAC conversion. Okay, so now that we have, so let's take the SP1 transcription factor as an example. So here's a sequence logo, all right? So when I said the PFM is, the sequence logo is a PFM, often the sequence logo is more likely to represent a transformed PWM or PSSM instead. That's usually what you'll find. So this is the SP1 motif from the Jasper database, which is a database of transcription factor binding motifs. Here's what the sequence logo looks like. Here's what the actual numbers on the matrix look like. Okay, so you can see this has 10 columns, just like this guy. And we can score any particular sequence in the genome in the same way as I showed you in the last slide. So we want to score this position, sorry, a motif starting here. GGGGGGGGCGGT, we just look for the appropriate cells within the matrix. This, this, this, this, this, this, this, so on. And we add them all up, okay? And then we get an absolute score, okay? So this is a raw score, so it scores 13. Does anyone know what that means? I don't know what it, I don't know what 13 means. So it's important to take any sort of raw score like this and to convert it into something that, that you can both use more intuitively yourself and also compare it against other motifs. So we will convert this to a relative score. And the way we'll do that is we'll find the maximum score that you could get from this particular matrix, all right? So if everything hit on, we have the maximum score in every column, we get 15.2. And the minimum score, if we got the worst possible score in every column, it would be minus 10.3, all right? Then we can create a relative score, which is simply our raw score from before, minus the minimum score, divided by the maximum score minus the minimum score. So this is a relative measure of how close your motif is to the best possible motif for this transcription factor. So in this case, it's 93%, which is, which is not bad. The other thing we can do is we can convert this to a p-value by taking this relative score and we can also come up with a, a frequency map for every other possible match to this motif, all right? And we can see how much of the area to the right of where this relative score is compares to the area under the entire curve, all right? And that's what will give us a p-value for this. And I imagine that this has a okay p-value that might not stand up too well after, after you do some multiple hypothesis testing correction. There are any questions about this? So, you know, much of it in the next part will kind of rely on this. So I mentioned Jasper. So Jasper is a database that you can use to get lots and lots of transcription factor, finding profiles. So it's, you can go to Jasper.genered.net. It's got lots of sequence logos, matrices to download, all sorts of other stuff like that. The data comes either from, data comes from published papers mostly, and they got their data usually through in vitro methods like CLEX or protein binding microarrays, although sometimes there are motifs that are learned from chip seek data. All right, so we have this, this model for looking at transcription factor binding motifs. So how does it work? The good news is there have been a number of tests that have shown that this works really well. So some trench and colleagues tested 50 predicted transcription factor binding sites with an in vitro test found that 96% of them were bound. And Gary Stormo found that the sort of information theory approach to PWMs that I told you about where you treat each column as, by looking at how much surprise you get out of it actually correlates very well with this sort of binding energy that you would expect from physical chemistry models. So these models should work really well. The bad news is that other people have found that these methods, a method based on position specific scoring matrices would predict transcription factor, every 500 base pairs in human DNA sequence. And here's the ugly, if you look at, say, the afton gene with a bunch of profiles, you will find a lot of binding sites. So the whole gene will be covered by transcription factor binding sites, certainly not just the promoter. There's a question of how meaningful this actually is, and that brings us to the so-called futility conjecture, which is that these transcription factor binding site predictions from these methods across the whole genome are almost always wrong, which is a bit disappointing. But we have methods that will help us with that, yes. The models typically are tested against in vitro. So in vitro, you can get a little bit of enforcings to find the one or another that you sometimes see. So the question is whether these, are you asking whether the data, the validation comes from vitro studies? Well, essentially what we're showing here, what I showed you here is that the validation works very well in vitro studies. And you can assume that they did appropriate negative controls, right? Then any sort of force. I wouldn't assume that's my point. Okay, well, you can read the paper and then you can make your own judgment. But I think the problem here is not due to a bad laboratory technique. It's due to the fact that there is a lot of stuff that is not accounted for by an in vitro assay. So the sort of complexity I was telling you about before, that's going to be the main problem here, okay? So our method gives us a score, right? And one of the first things people thought to do when they, one of the first things people thought to do when they encounter this is, let's just increase the threshold which we use to decide whether something is a real transcription factor, binding site or not, okay? As it turns out, that doesn't help very much, okay? So you can increase the threshold. The factors that are causing this model to fail in vivo are not related to inherent properties of this model. So you can't just increase the threshold that you use and expect a better result, okay? The problem is that the binding sites, again, are defined by some of that complexity I talked to you about earlier. The fact that the genome is in a three dimensional structure. The genome is also wrapped around itself in a lower level chromatin structure. And there are only going to be certain regions where transcription factors can bind. So we'll talk a little more later about ways to get around this. So what did we learn in the first, in the second part? So position specific scoring matrices can reflect in vitro binding properties of transcription factors, but it doesn't work so well in vivo. So we have to add additional information to overcome that, which we'll go over in a later section. Any other questions on this part? So before we go on and talk about that, I want to discuss how the transcription factor binding sites, where these models actually come from, all right? So in the beginning, I showed you a list of aligned, I think they were 9-mers, right, 21 9-mers. And it was a really simple matter to get a column score for every column within that group of 21 9-mers and turn that into a PFM. In reality, no one is going to hand you a bunch of 9-mers and say here are where your transcription factor binding sites are. So if we want to discover motifs in the first place, we have to figure out which sub-sequences within the potential transcription factor binding sites are actually responsible for binding, okay? So this is called the motif discovery problem. Given a number of sequences that you know or you have information that they are bound by some transcription factor, really this is a more generalizable problem. So you can use this for proteins as well. Set of sequences that interact with some other sequence recognition factor. You want to find a motif, which is a set of sub-sequences that is an in common modular that degenerate nature I've told you about before. So you want to find how many motifs there are. You want to find the width of each motif and you want to find the locations of these motifs. So this is a surprisingly difficult problem because the input sequences can be really long. And again, the motif, if you were just looking for short exact matches between all of these sequences, it would be relatively easy. But the fact is you have to allow all of that nuance that you have in a position specific scoring matrix or sequence Lego, right? Things can maybe be only slightly similar, right? So here I'll give you an example of how you might run into this problem in the lab. So let's say we're given a set of promoters from co-regulated genes. So here are a bunch of genes. This was identified by RNA-seq and a bunch of different conditions. You find that these guys are always correlated in their gene expression. When one is on, the other five are on and vice versa. So we know or rather we suspect that there's some transcription factor that is responsible for this behavior and we want to figure out what it is. So it binds to positions that are unknown to us and they could be on either DNA strand. So let's say we know or we're trying to look at this with hindsight and understand the problem. We don't know this from the start. But let's say that the binding sites are here, here, here, here, and here. One of the additional complexities is that it might be on the plus strand, in which case it's A, A, G, T, C, A or it might be on the minus strand where you have something similar that runs in reverse, right? So this would be the reverse complement of this, A, A, G, T, G, A, G, T, C, A, right? So we'll assume that the binding motif can be described by a position-specific scoring matrix. So I usually say position-weight matrix. It makes it less likely I will trip over my own voice. So our problem is simply, not very simple, but it's to discover the motifs and the sites just given a sequence. Not any of the other information that I showed you before. So generally, what we will use is, so it's really hard to do this exhaustively. There's not enough compute power to try out every possible position-weight matrix that you could think of. So we use an alternating approach. And this is something that is often done within machine learning or other sorts of predictive modeling. So what we will do is we will start by simply guessing an initial-weight matrix, believe it or not. So we will have some sort of algorithm that we use to make an educated guess. But essentially, we'll start with what is a guess. And then we will use that guess-weight matrix to predict individual instances of the binding site. And then we will use what has been predicted there, maybe with some dropping out, as I'll show you in a minute. So predicting new weight matrix, and then we'll repeat the process over and over again. So by this, you essentially start with some guess and then you refine it and you make it better until you can't really make it better any further using this method. And then you can set this up so that the quality of the matrix always increases through one run through an alternating approach. The problem is you always have to worry about getting stuck in some sort of local minimum. So it's impossible to find the global maximum. And I should say you don't want to get stuck in a local maximum, right? But you can find, from some particular starting point, you can find something close to the local maximum. Any questions? Yes? Is this based on the knowledge motifs in Forehand or no knowledge whatsoever? Question is, is this based on knowing motifs beforehand? No, this assumes that you know nothing except for the sequences. So you're going to discover this totally de novo. The guess does not come from previous information like that. So I'll show you one particular way of doing this to make this a little more concrete and understandable. So let's say we want to discover a motif in these five sequences. We'll start. Here's the random guesswork. We don't just make the PWM totally out of thin air. It will be better to pick a number of sites and then take whatever and those we will pick randomly and to identify the sequences at these sites and then to create a position weight matrix from that. And then what we do from that, we'll define our technique more specifically. I'm going to tell you about one very simple technique called the Gibbs sampler where in steps two and three, instead of just refining it straight on, what we'll do is we will drop out one of the particular sequences and define the position weight matrix based on the remaining sequences. And then we will pick a new sequence based on the probability distribution from the previous step. So this is something that should reduce problems from overfitting because you're throwing out a little of your data in each step and you're kind of repeating the process. So you'll probably go through this in a cycle, eliminating each one of the sequences in turn and learning from the rest. I think all, this is a diagram that shows how this works in a little more detail, which will be in your notes and I won't go into it in much further detail. So a variety of other techniques like Gibbs sampler, the most commonly used tool for this sort of problem is called Meme, which uses something called expectation maximization instead, which is a little more difficult to explain, but you can think of it as conceptually similar. So at the end of this, you will end up with a position weight matrix. And then question is, what do you do with this position weight matrix? So here's where the data from the database or previous understanding motif can come in. So we've discovered this query motif to Nevo and we can use a technique called a tool called TomTom to search for matches to this motif and a motif database like Jasper and then it will give you this nice little report showing us that our query motif here is a really good match for motif 795 and Jasper, which is also called the optimal motif. So that's one way to go about figuring out what is responsible for a coordinated transcription regulation instead of genes. So, yes, question. So in reality, I usually do things both ways. So sorry, I should probably repeat the question for the recorder. So the question is, why would you discover this DeNovo and then compare your DeNovo model against the database? Why not just search using the models that are already in the database? So, yeah, like I said, I would normally do this both ways. The problem with just using what's in the database is you will bias yourself towards what's already known, right? So if there's something new, you won't discover it. In actuality, how many are in this database? So the database has along the orders of hundreds of transcription factors. And in humans, they're estimated to be 1400 to 1900 transcription factors. So there are a lot that are still missing from these sorts of databases. Unfortunately, I know people have various people working on ways to exhaustively get motifs for all the transcription factors that we can get them for. But some of them are a little more difficult to get than others. So before I go on further, are there any other questions? So I'm gonna talk here a little bit more about how to go beyond looking for sort of insight you can get from understanding sets of transcription factor binding sites. And this is sort of the first thing that we'll talk about. It gets around the futility conjecture I talked about earlier, right? Which is that we can use more information than just whether particular binding sites occur or not, okay? So we'll reintroduce this problem. Let's say we have a series of microarray experiments and we can get some sets of co-regulated genes and also some negative controls, all right? So here's a set of co-regulated genes, co-expressed genes and some negative controls. And if we look for a particular transcription factor motif here indicated by the sequence logo or position specific scoring matrix, we can find the examples of that in the co-expressed genes. And maybe we won't find it much in the negative controls, all right? So essentially what we want to do is we want to see whether there is any enrichment for a particular motif within the co-expressed genes that is missing from the negative controls, okay? And we can do this with a very similar technique to what we would use for finding go term enrichment, okay? There are a couple of different ways that we can look for this. One is we can look for in the positive set, foreground set, we can look for more genes that have the transcription factor binding site for a particular transcription factor than the background genes. Or we can look for more total sites. This is often a better approach because as I mentioned, you'll have the transcription factor will, binding site will often occur all throughout the genome, right? So even in a set of supposed negative controls, you're still going to have binding motif instances for any particular transcription factor. But what we're looking for is a concentration or enrichment of these binding sites at, say, the promoter of some gene or enhancers connected to it, and that will give us additional information that we can use to further that transcription factor is important in that gene's regulation. So these two different modes for understanding whether there's an enrichment of transcription factor binding site, you know, one is to use Fisher's exact test. So that's the looking at the number of genes, so I showed you on the left side. The other method is the binomial test, which is looking at the number of occurrences. And thankfully, there are, you don't have to, say, re-implement this yourself in R. There are other tools that exist to do this for you. So there is a website called opossum, which will allow you to take a set of co-expressed genes. You can feed it in a gene list. You can feed it in a set of sequences. If you feed it in a gene list, it will retrieve sequences from ensemble. It will optionally narrow down the sequences you look at to only those regions that are well-conserved evolutionarily between this and some related species. So you might look for sites that are conserved between human mouse, between mouse and rat, human and dog, or something like that. I'll mention, I'll describe this a little more in a second. Then it will detect transcription factor binding sites using the same sort of scoring methodology we talked about earlier. Come with this method, a description of the statistical significance, and then you'll get some putative transcription factors. One thing I want to talk about real quickly is this idea of phylogenetic profiling. So around when we were collecting lots of genomes of different organisms, this was really in vogue. And this was seen to be the best way out of the futility conjecture, which is just look for those regions that are really well-conserved. If evolution has conserved them, they probably confer some selective advantage. And they are probably quote unquote real. I think this has fallen a little out of favor in the intervening years. Now that we've started to develop the sort of, say, chip-seek data and all of those other data sets that I told you about at the beginning of this lecture. Because what we revealed is that that really makes things way too stringent. Is there a lot of real binding sites that you cannot identify with methods like phylogenetic footprinting? So I would be cautious in its use. And personally, I never use phylogenetic footprinting. Looking for evolutionary conservation is something I would do afterwards when I was attempting to describe the implications of particular sites. And not something I would use as a pre-filter. All right, so let's look at a set of results we could get from running opossum. So here we have a set of genes that, so we have the microwave experiment I showed you before. And there are some muscle-specific genes and some liver-specific genes. And we identified these various transcription factors in these particular gene sets. So SRF, phylogenomial test, Z-score, has a really good score, so does MEF2, so does C-Mib, so does MIF, et cetera. As it turns out, especially these guys with AM at the beginning of their names are known to be muscle-specific. So this shows you that you can identify valid transcription factor enrichment within these sorts of data sets, and you can do the same sort of thing with these liver-specific genes. These are known liver-specific genes. And the other ones may or may not actually exist there. So you also see some amount of false positives. These have not been experimentally verified. So in general, this is something that will give you some information that you would use to plan future studies. So if you did something like this, you wouldn't necessarily want to just publish in your paper, oh, yeah, this thing is full of SRF. Binding sites, you might want to confirm that further, potentially with chromatinone precipitation or some other method. So think of it as something that gives you additional understanding but is not necessarily the final word due to those futility problems I told you about before. So Apossum said you can find it on the web. I think there are links from the wiki. There are a variety of different options here if you want to try this. I would start in the first row, and you can pick your species here. The species are just if you have a gene list, it makes it really easy for you to just put in the gene names there. If your species is not one of these fives, you can use a sequence-based version instead and just supply the FASTA files for your putative promoter regions or whatever, which you can extract from Galaxy or from somewhere else. Are there any questions about this part? Yes? Yeah, so the question is how well would this work for a single gene rather than a set of corregulated genes? So you can identify putative binding sites and say the promoter of a single gene that most of them, back to that futility conjecture again, most of them will not necessarily be meaningful. So in this method, we've relied on statistical power we get from enriching over many different data points. That's what doing this sort of analysis gets us, so you have much more confidence. Within a particular gene, you can use, so you wouldn't use this, you would use something called FEMA, which is part of ImmuneSuite, and we'll use FEMA in the lab, but you would want to be really, really cautious, much more cautious about interpreting results from a single gene. Better for a single gene if you're working in an organism where there is good epigenomic data available, like chip-seq data or DNA-seq data, is to use that data and use that to interpret what's going on in your favorite gene. And you can also sort of combine those things. So if you know that say there's a DNA hypersensitive region upstream of your gene, you might want to focus on motif search just on those, and that would be another thing that would increase the likelihood of it hit. Just one more question. Yeah. Does it allow for tissue specificity to come in a useful way to restrict to relevance for searching factors, to not express in your tissues, say deliver specific work with muscles in the tissue? It would be cool to allow the ones that are in the wrong tissue. Yeah, so the question is whether databases like Jasper include information known about cell type specificity of motifs. It's not there in a structured format. So usually if you go to Jasper, there will be a, if you're looking at one transcription factor, there will be a series of, besides the PWM, there will be links to the papers where it was described, the papers and that will tell you about it. There will also be links to the gene identifiers for the gene that expresses that transcription factor protein. And from there, you could infer this sort of thing. But again, I think it's personally, I think it's a better approach to do something like this without sort of biasing yourself towards the expression in one particular transcription factor. And then use that sort of information post-hoc to understand what's going on, then to use it as a pre-filter because you're going to miss things that way. And then to use it as a pre-filter because you're going to miss things that way. And an important thing to realize is that there's a lot we don't understand. So say we have some particular transcription factor that so far has only been seen expressed in liver. That doesn't mean that if you're working on some sort of muscle cancer that it won't show up there that it's invalid. So it's important not to reinforce assumptions that are made by the happenstance of where we've identified things so far. Yes. Meaningfulness, so you mentioned two database, one gene, are we going to look into that link? I'm going to talk about it a little more. So there are links at the beginning of the lecture which you should have a copy of to the ENCODE project and roadmap of genome epigenomes project. And they also have excellent tutorial publications online. There's a lot of data in there so it's not necessarily sometimes so easy it's just clicking on different things if you don't know what you're doing yet but they do have documentation and tutorials that makes it easier to understand that stuff. Yeah I could probably talk about those particular websites for two hours or so and I have in the past although they're not here in CPW. So let's talk about one other wrinkle here which is that we'll often have families of transcription factors with transcription factor binding sites that are indistinguishable. So here for example we have the ETS family. There are a variety of different ETS proteins and they all seem to bind to very similar DNA sequences. So how do you pick which one? If you're in this particular set of circumstances and then there are software called top gene which can be used to understand this particular problem further but most of the time this isn't something that you will run into and things like as just mentioned on the slide zinc finger proteins even though they're all in the zinc family they have a broad diversity and the motifs that they recognize. So any questions on this part of the lecture before I go further? All right so now I'm going to in this final part talk about some other techniques that you can use to bring in some of those other data sets and to understand how transcription factors influence gene regulation and maybe not necessarily right at a transcription start site. So one thing that I'll mention here is a technique called Segway. Segway is software that has come out of my lab and it allows you to integrate lots of different data sets, trip seek data sets and boil them down to a simple annotation of a genome at every base on a genome. So Segway will say take the human genome and it will assign some sort of activity to every part of the human genome and some cell type. Okay so they'll say this is a promoter, this is a transcribed gene, this is an enhancer, this is a repressed region, this is regulatory, this is heterochromatin and so on. And so if you're looking at say encode data using something like Segway, if you're using human encode you can get to Segway right from any major genome browser. So it's built into the UCSC and Ensembl Ensembl genome browsers, you can just turn on the right track and then it will tell you generally what's going on in a particular region and you can use that to narrow down the regions that you're interested in and analyzing further. So you limit yourself more to a transcription region rather than a say repressive region. There's another really cool tool which is called GREAT, which you can use if you've had some experiments say that it's giving you a number of sequences or regions that aren't necessarily correlated to a gene. So GREAT solves the problem, if you have a gene list it's relatively easy to do go enrichment or similar sorts of functional enrichment. If you don't have a gene list, if you instead have a set of sequences or you have a set of regions in the genome which may occur within energetic regions you can use something like GREAT instead. So it'll take input a bed file with all of those regions and it'll give you output that looks like this and if you drill down further it will tell you say which pathways seem to be enriched within the particular regions you talked about. I rather you input it. So most of the way this works is by taking those regions and associating them for example with the venerous gene. So it does do something similar but it solves a lot of problems and has appropriate statistical treatment that you would not necessarily get if you decided to say write a script that just found the venerous gene for any particular set of regions. So most of what I've talked about up until now has been looking at particular single transcription factors but often transcription factors work in concert so they will cooperate with each other and you will often find two transcription factors so they're often found in a pattern together. This is another sort of thing that gets around the futility conjecture. You see additional information and supplies from having two transcription factors that seem to be really close to each other and there's a tool called SPAMO which is part of the MIMS suite that allows you to do sort of dual transcription factor analysis. You can use it to find whether there are two transcription factors that are near each other that are spaced in a very particular way. So finding these two sites that are always say, almost always eight or nine base pairs away from each other will be a release and you can see something like that in these histograms here. We very strong signal that this is a real interaction. So there are a lot of challenges in this field of understanding transcription factor binding and making predictive models of transcription factor binding had. So first is the holes in these databases. Like I said, there are maybe 1,900 transcription factors. So first people aren't even really sure that number, right? So I usually say between 1,400 and 1,900 transcription factors. We haven't validated all of them. We certainly, we don't have in vitro data on what all of them bind to. We don't have in vivo chip seek data of how that actually plays out in reality. So that will be a big challenge that, I was hoping would be solved, more solved by now that these things, getting the last bit of a project like this can actually be a little stickier. Understanding how genetic variation and transcription factor binding sites affects how those binding sites are effective. I think will be a really important challenge in the years ahead. A lot of the people are doing, say, human genetics. They've focused a lot on exome sequencing, which often ignores the variation that you would find in other non-exonic regions that affect gene regulation. And it's likely a lot of these are important in defining various phenotypes or diseases. Integrating different data sources, such as chip seek data, will be a, or DNA methylation data, will be an important challenge as we go ahead. And finally, these models that I showed you, these position weight matrices are really convenient, but even at the in vitro level, even if you're working with naked DNA, they're probably better models that we could use. So people are thinking out shifting to energy models. There's work in understanding how changing the DNA sequence actually affects the shape of the DNA helix. So not just where the base, what bases are sticking out to the minor groove, but actually different series of bases will cause torsion on the DNA double helix in different ways or cause twisting in various ways. And people are starting to understand that more and more. But mainly, as somewhat of a biased person, I think a lot of this complexity is really coming from, I say I'm biased because this is what I study in my lab, the interactions between lots of different factors and lots of different things that can be measured in the epigenome. And we're going to need to understand how all of these things tie together a lot better to make really good predictive models or to improve them further. So I'm going to briefly summarize up here on this lecture. So first is we have models for transcription factor, binding motif discovery and analysis, which work really well in vitro, but in vivo, finding one of these sites with these methods alone has very little relationship to some sort of vivo function. But using additional methods on top of that, such as looking for clusters of transcription factor binding sites can be helpful. Conservation can be very helpful, although I'll also give you some amount of false negatives and looking at interactions with epigenomic data can also be very helpful. And I think that will be a big work there in the future.