 Hi everyone, I'm Michael Hoffman, I'm a scientist at Princeton Margaret Cancer Center in Toronto and assistant professor of medical biophysics and computer science there. So today I'm going to talk about gene regulation and motif analysis and what that means, I'll tell you in a minute. So at the end of today's module you should understand challenges in predicting transcription factor binding, especially in predicting transcription factor binding from DNA sequence. You should be able to identify binding sites for known transcription factors and you should be able to discover transcription factor binding motifs in genomic regions such as those from chip seek peaks or promoters using irregulant and cytoscape and that is what will happen in the second part of this morning, Varanique has prepared a lab that demonstrates some of the techniques shown here using a WISBA interface called irregulant that uses some of the cytoscape stuff you guys have learned earlier this week. So there are a lot of different little parts here, we're going to learn about eukaryotic transcription and we're going to learn about several different computational tools for predicting transcription factor binding, where they work, where they don't work, what's next in the field and so on. So I'll start here with our introduction to eukaryotic transcription. Here is a really oversimplified version of transcription in the eukaryote. So you have some DNA here and the first step is you have some transcription factor which recognizes some motif, the motif is in green here so that green represents a transcription factor binding site, the transcription factor recruits RNA polymerase 2 and then RNA polymerase 2 comes along and produces RNA. So how much fun did I have? Not very much because I did it a few years ago and then I lost the animation so this is all me redoing something I did several years ago. It was not so fun, no, but I will say this is the unveiling of the new animation so for the past few years I've had a note to myself, fine animation, it's because I care about you guys so much, you actually get the real new animation. So that's a very simple version, who here, do you all work on eukaryotes or any people here who mainly work on bacteria, anyone? People work on yeast, metazoans, plants, is everyone here an animal person? Now yes. Now yes? Any yeast people? Yeast people. Yes? Okay. It works, Francis is a yeast person. This works kind of the same in yeast. As I'm sure, I've talked to a few of you guys, most of you have some biology background, a few of you don't but you probably know, it's actually a bit more complicated than this so there are a lot of other factors and this is where things start to vary between yeast and humans and other animals and so on. A lot of other things that affect whether you have transcription or not so here's a transcription start site, not only will it be affected by some transcription factor binding site, there are also going to be lots of transcription factor binding sites, just there will be transcription factor binding sites near the start of the gene, also distal from the gene, you have to figure out how those work out, there's splicing alternative transcripts that can affect gene expression as well, that can even affect gene transcription before things are separated off of pre-MRNs, so there's a lot of complication. If you zoom out a bit, you can see it's even more complex than what was shown in that previous model because all of this is happening in 3D so on the one hand it makes it easier to figure out actually how do these distal regions affect transcription, it's actually because they can be quite close in three dimensions, but figuring out how this works adds an additional level of complexity to modeling this, to taking measurements of it, and of course all of this is in the context of chromatin structure, so DNA is wrapped around nucleosomes and nucleosomes have different modifications of their own and so on and so forth, so a lot of what people have been doing in the past few years in gene regulation is trying to understand not just this very simple model but trying to understand the structure of DNA and chromatin at higher levels being able to understand the other biomolecules that interact with DNA and contribute to transcription and other parts of gene regulation, so people have used a variety of techniques over the past few years that I will throw into the general category of functional genomics. They include things like DNA-seq or a fair-seq or nowadays a tax-seq, so things people can use to figure out where regions of open chromatin are and often a lot of the processes that I showed you in a much simpler model will only work if you're in a region of open chromatin to start with. People also have developed chip-seq techniques that they can use to figure out both where individual transcription factors are bound along the genome and also where various histone modifications are, so the histones that make up the nucleosomes, the DNA is wrapped around, will have various covalent modifications that can be signals for different sorts of gene regulatory activity happening. So you can get RNA-seq and you can use that to get a map of where all of the genes are, but with all of this other data you can also get maps of where all the regulatory elements are, both the nearby proximal regulatory elements and also the long-range regulatory elements which are far away. So there have been projects like ENCODE. ENCODE has collected by now there are more than 10,000 experiments that are deposited in freely available ENCODE that will tell you various things about chromatin structure, various things about properties of chromatin in different human mouths and sometimes fly-in worm cell lines, and there's a lot of data available. So you get data from the ENCODE project, you can get some data from UCC Genome Browser, Ensembl, there's a lot of data deposited in GEO as well, although often it's a lot easier to use these data if you go first to a consortium website like the ENCODE website or the Roadmap Epigenomics project website or so on. So there's a lot of data and we'll talk a little bit later on as to how you can use this to figure out the context of transcription within this much more complex model of transcriptional regulation. But first we'll go back to the very oversimplified version and figure out how we can predict using a computer exactly what is being transcribed. Any questions before I move on? Alright, so in this next part we're going to try to come up with models that will allow us to predict just based on sequence whether there is transcription factor binding. And then from that a single transcription factor interacting with maybe other transcription factors can drive transcription but often if you're trying to figure out what is actually the locus of regulatory control for a particular set of conditions it's going to be a single transcription factor in its binding. So here's one site of a transcription factor binding, so a transcription factor binding site, single sequence of DNA. Unfortunately this is the main complication, or maybe I shouldn't say it's the main complication. First complication in understanding what transcription factors bind is that they do not usually bind to an exact DNA sequence. Usually they combine to a whole sequence of binding set of related sequences. So here's a set of binding sites for some transcription factor and if you're just eyeballing this you can see there's a lot of similarity between these different binding sites. There's usually a TT at the fourth position, things are mostly the same but there's some differences from one place to another. There's some places that are more likely to change, other places that are less likely to change. So we need to be able to represent that if we want to characterize what DNA a transcription factor binds to. So here's one way to do that, is to use the IUPAC ambiguous DNA alphabet. So you've probably seen, if you aren't familiar with the whole alphabet which is sort of the province of geeks like myself, you've probably seen at least parts of it, maybe you've seen R as an abbreviation for A and G or Y as an abbreviation for C or T. There's a whole alphabet that will actually allow you to represent any combination of A, C, G and T. Like this V here means A, C or G. So you've got A, C or G here and then you've got this which is A, C or T and this R which means A or G and so on. You can represent all of the individual binding sites here, combines the one sequence. Can anyone tell me something, you know, anything that might be missing in this description of these set of sites? The frequencies, yes, exactly. So, you know, you can say, all right, this position, this first position is never T, right? But if you look at these here, yeah, it's never T, but most of the time it's actually A, right? So you've got more information, in fact, it's just never T, right? So people come up with a slightly more complex model to incorporate that information and it's specifically, it's called a position frequency matrix, right? So here you can take this set of sites and represent it as a position frequency matrix or PFM and the matrix is just four rows, one for each symbol in the alphabet, right? And a number of columns for each column within these aligned binding sites and you simply count up the number of times you see each one of these symbols and put it here in this matrix, right? So you see A 14 times, C 3 times, G 4 times, T 0 times and so on. And you repeat that for the rest of the columns. So probably only a few of you have seen PFMs before, but I'm sure most of you have seen this sort of representation of a matrix that describes a motif, which is called a sequence logo, right? So this is the same thing as this. It's just a graphical version of this essentially, right? So every column you have a set of symbols and the height of the symbols indicates how frequent the symbol is in the position frequency matrix. Now you can use a sequence logo to represent a couple of different underlying matrices and a PFM is just one of those and we'll go on to some others in a second. Any questions here? Yes. So this model still doesn't come with the cases like the dependency between different positions in the sequence? No, it doesn't. Is there anything of that sort just kind of difference? There is. Why don't you ask me about that at the end? Because we have to do more complex models that don't account for the interdependency between sequences before we can move on to things that might actually. Any other questions? Okay. So I said there are a couple of different kinds of matrix. Actually, when people are the software that people use to find transcription factor binding sites, we rarely use these raw position frequency matrix and matrices. Instead, people use something that is called a position weight matrix or a position specific scoring matrix, which usually I think of as synonymous things. And I'll explain how that works in a second. So let's start with a simple position frequency matrix, which we'll call F. F has four rows or bases, AC and GNT and it has five columns and we'll call a specific column I. All right. So this came from a set of sites. It may have come from chip seek or Clex or various other various other things. And we got five sites at the end. And this is the position frequency matrix we got from counting up this five wide motif in five different sites. So what we do is we add various corrections. All right. So the first thing we will do is we will correct for a nucleotide frequencies in the genome. So we'll divide anytime we have the frequency at a specific base and position, we will divide by the frequency of that base in the genome overall. All right. So this is something that will, this division will allow you to incorporate how surprised you are to see a particular base. Because if you said, if you say had some very AT-rich genome, like maybe you had something with a GC ratio of 10%, so this genome is almost all A's and T's, you are not going to be very surprised to see a motif that has a lot of A's and T's of it. So you want to incorporate that surprise all into this somehow. Another thing that we're going to do is you can wait for how many samples you have going into your position frequency matrix. So using the PFM formulation, right, if you have five sites here, sorry, five sites where the first position is in A and there's zero everything else, that is considered equivalent to, if you say had 100 sites where the first position is A and nothing else. But you know intuitively, that's not really, that's not really true. Really the more times you see something in your input data set, the more weight it's going to have. And what we do here in order to account for the fact that we have limited data coming in is we add what's called a pseudo-count. So in this case, we might take this PFM and we might add one to every position. So then this first column would be 6111. So that will keep us from having zeroes on the matrix, which means you won't get a zero score if you ever see something that is, that is not A. Because when you have a limited set of input sequences, used to say in the thousands of real sequences, you'll see in nature that you aren't ever going to see an A there. It shouldn't be absolutely disqualified. On the other hand, by adding a fixed number one, right, that means that if you have a hundred or a thousand sites coming in, it's going to affect the imposition weight matrix a lot less by just adding one than in the case where you had say five or ten sites coming in. And the final thing we're going to do is we are going to convert all of this to log skill probability so that we get very easy arithmetic. You probably, most of you probably think that computers are very fast at doing any sort of basic arithmetic, but actually computers are much better at adding than they are at multiplying. And if you can convert things to log space, and you're going to do a lot of calculations on them, it's a lot more convenient to have things in a way where you can just add up the numbers later on. It also reduces various sorts of numerical stability problems you might have. So here's an example of a PWM that we got out of this PFM here. And here's how we can use the PWM to score this particular sequence instance against this motif. So TG, CTG, we just look at the corresponding row and column in the matrix. So the first column is T, and we take the score there minus 1.7, and we repeat that here, GC, T, and G. We just get all of these individual scores, and we add them together and we get a final score of 0.9. And if you want to convert that to the original probability space, you can exponentiate it, but usually people don't bother. People just work in these long probability scores. Any questions about position weight matrices? Or as I mentioned, they're also called position-specific scoring matrices. So let's extend this here. Let's take a case where we have a motif that we've already defined somehow. So this is a SP1 motif. You can see the sequence logo here. You can see a position weight matrix version of this logo here. And here is a sequence, and we are trying to score the whole sequence against this motif. So right here we have... So we'll repeat this sort of process for every position within the sequence we're scanning. So let's imagine we've gone into here, and we score this sequence here against this motif. So we've got GGGG, so you add up what's here, GGGG, and so on, and you get what's called an absolute score or a raw score of 13.4. So again, we want to go back to this question of how surprised we are to see this. And one way to do that is for us to look at what in this particular motif would be the maximum possible score for us to get. So if we had something that had essentially whatever is the biggest letter at every column here, and we sum those up, that would give us 15.2, and the minimum score is minus 10.3. And that gives us an indication of where this raw score fits in to all of the possible scores that you might get out of this particular motif. We can simply calculate a relative score by taking the absolute score we calculated here minus the min score and dividing that by the full range of potential scores, maximum min scores, and we get 0.93 or 93%. So this initially seems like a pretty good match to this motif. It's 93% of what the biggest possible match is. We like doing things with more of a p-value approach. So instead of just looking at that, we might take the relative score and we might see where it appears amongst all of the different possible scores we'll see within our genome of interest. And so the definition of a p-value is the probability that you get something of this score or greater under a null model. So the null model is that there is no enrichment for a particular motif here. And you can simply divide the area to the right of the value over the area for the entire curve and then you get some sort of p-value. So the closer you get over here to the max score, the smaller your p-value is actually going to be. Any questions? So if you do motif matching, this is often where the p-values are going to come from. So the other question we might have here in this previous experiment that I showed you is where does the motif come from? I mentioned that it might come from ChIP-seq or CELUX experiments. Now there's some databases in Jaspers, the one I use most often that has hundreds of different sequence logos slash position weight matrices for different transcription factors that have been curated carefully and come from all sorts of different experiments. So you can use that if you ever want to, if you ever want to find individual motifs. Yes. How about other databases? Because I know like for example Transfac is commercial, I mean it's commercial unit like to have like in details in order to get access to the data and I guess you pay everyone. So is there other databases that are rather complete? I don't really... Okay, so the question was, well, so first the question mentioned Transfac, right? So there's another database called Transfac and to get access to the full version of Transfac you need to pay. So Transfac has many more motifs in it than Jasper. My question from looking at Transfac is that they have slightly lower stringency in deciding what sorts of results will go into Transfac. So usually I want more high confidence stuff so I don't really, you know, worry about not having Transfac in my analyses. So the question was really, are there any other databases that, you know, compare to Jasper or Transfac and I am not really aware of any you know there's not really incentive for anyone to create something like that because we already have Transfac on the commercial side and Jasper on the free side. So anyone who wanted to do something like that would hopefully participate in Jasper instead of trying to roll their own thing. Any other questions? Yeah, there's, you know, so there's some, I've been doing this for a lot longer than it may seem because I think I look younger than I am or at least I hope I still do. You know, and you see these sorts of things come and go and there are various, you know, there have always been various commercial approaches to various problems in biometraumatics whether it be methodological or data or so on. And what I've seen time and time again is at one point these things look better and at some point a few years later they are surpassed by something that is sometimes completely different and also usually free. So I have learned my own personal bias from my experiences to think of some of these commercial things as kind of a dead end, right? Because often you don't have a good idea of exactly what is going into the product or what people are going to do to continue to build it up over the years and to keep innovating it and so on. So it's something I avoid and I haven't felt like I've missed out yet but you know it's like I said that's my own bias and we'll see what happens over the next few years. Alright, so moving on to the next step here which is those motifs that we have in databases like Jasper and Transfac, both the commercial and non-commercial version. Where do they come from? So I mentioned there are experiments, there are things like CLAX, there are things like protein binding, microarrays which maybe in a future version of this talk I should have an overview of those sorts of things. But in the end what you get out of those is a list of sequences and somehow you have to convert those list of sequences to a position weight matrix. And you also have to contend with the problem that the sequences do not come to you nice and aligned with everything matching up at a particular particular column. So that brings us to what is called the motif discovery problem. So if you have three sets of sequences and let's say you know and there are long sequences, so hundreds of base pairs long in this case, you know or suspect that there is some sort of short motif that is in common between the three of them how do you find those motifs. It's not as easy as say doing a sequence alignment of these against each other because as you remember the motif is degenerate, it's not you aren't going to find exact matches between the different sequences. So given these sets of sequences, we want to find a number of motifs we don't necessarily know the width of the motif starting out and we don't know where they are. This is hard not only are these things degenerate, the motifs are actually fairly short and the sequences can be fairly long. I'll give you a little example here of how someone might do this. Let's say we're given a set of promoters from genes that we suspect are co-regulated but we don't actually know what sort of molecular process is driving the co-regulation. So here are a set of genes. Here are a few dozen base pairs upstream of the transcription start site which we will call the promoter. We suspect where we hypothesize that a transcription factor binds the set of sequences in common but we have to remember one other problem that wasn't disclosed before is that the transcription factor it doesn't have our notion of strand either it could bind to either strand so sometimes we can say A, A, A, G, A, G, T, C, A here or sometimes you'll find the reverse complement of that G, G, A, C, T, C and so on. We'll assume that we can describe the binding motif with the position weight matrix and we want to find these sites. So the way people usually do this, this is an approach we use quite often in trying to find parameters for a model we have given a set of input data is a so-called alternating approach. We use this here in discovering motifs. I also use an alternating approach in an entirely different area of science where I'm trying to model different chromatin states that will occur throughout the genome. It's a way that people often use to find parameters. So what we do here is we somehow come up with an initial set of parameters or initial weight matrix here. So you can come up with this completely randomly, you can come up with it with some sort of initial guessing algorithm and then we'll use that initial version to predict instances across the sequences that we have and then we'll use whatever we found to slightly refine what we started with and we'll repeat this process. And various versions of this alternating process will give us a guarantee that from some position we will find a local maximum. Now notice I didn't say global maximum. So that's the main problem with these sorts of alternating approaches is it often relies a lot on what your initial value is. So the various approaches to that one approach is just to make sure you try a lot of different initial values so you don't get stuck in some sort of local maximum but something that is not near the global maximum. So I'll make this a little bit more concrete here in one example of a alternating method which is called the Gibbs Sampler. There are other methods like expectation maximization that are kind of similar in concept but slightly more complex to explain. So I'll show you how a Gibbs Sampler for finding motifs works here. So in this particular Gibbs Sampler, instead of just say randomly initializing the matrix altogether we are going to pick random subsequences from each one of these sequences. So just totally random. We don't actually know whether they match up very well or not. And we take those and we'll pick one of the subsequences at random and exclude them. So here we took out number four and I'll show you why in a second. We take these other four and we calculate a position frequency matrix which we then turn into a position weight matrix using the way I showed you before which thankfully is all in your little notebook if you can't remember the fine details of that. So the point is we take this and we exactly get a position weight matrix. Then using that in the excluded sequence four we are going to score this entire sequence. So using this matrix we are going to get scores at every position here. So here is a position here that matches that position weight matrix pretty well. There are some positions here that match it a little less well and some positions that don't match it at all. If we take the scores we have here we get essentially a 52% chance that there is a match to the sequence right here in this biggest peak here. But we aren't going to immediately go to what has the highest magnitude because that's definitely a recipe for going to something that is a local optimum and even in the nearby neighborhood is kind of a pessimum. Instead we are going to use this as a weight for picking something probabilistically and in this case even though this only had 20% of the weight we got that lucky one in five where the algorithm has picked this sequence right here and then we have taken this sequence that we have picked and put it into our set of sequences and then we will repeat the whole thing. So then not in the initialization step but we will repeat the rest of steps two and three. So we will then throw away a different sequence here and repeat that whole scoring and picking using a weighted probability process and repeat that down the line. And then we are going to repeat this whole process a bunch of times and then we will find whatever motif has the highest score for this set of sequences and that is our motif. So that's how it works. I'm not sure if it's magic or not. Some days it certainly feels magic that something like this would actually work because it's not possible to actually exhaustively go through every possible motif that might match these sequences but if you start with enough initial values you will get a value that is at the end that is often pretty close to what might have been the optimal loss of value and people have determined that using various simulations. Any questions about this? Yes. Do you have a cutoff for the highest motif score? That is one way you could do this. You could say I'm only going to various stopping rules. I would probably say repeat this process 10,000 times and take the best. You might take a process via which you say I'm going to repeat doing this until I have a motif over a certain score. And if there actually is a motif here that has that particular score then you'll find it eventually. I would usually go to something where you instead pick a certain number of initial random values you're going to pick from the start. To start with this particular process, this particular example here, we started with a set of potentially co-regulated genes so we may have started with 500 base pairs upstream of all of these genes or you can do 2,000 base pairs or whatever. This process becomes very expensive and very slow very quickly even in 2017 so people try to pick fairly short sequences. For the motif discovery problem I think it's still not really tractable to do it genome-wide although after you've identified motifs you can certainly scan genome-wide for a motif you already know about fairly easily. Is the motif fixed? Oh, length fixed. In this particular instantiation of the problem this solution essentially has a fixed width so if you don't know the width ahead of time you essentially have to try a number of different widths and see which motif scores better and that gives you a variety of different problems. There definitely are approaches people have tried that will do that sort of extension. This simplest version here doesn't but I've seen versions where people try that. Yes? So the question was how quickly does this work in practice? So when I've done this sort of thing it's usually been with the meme chip pipeline which is designed for doing this sort of thing on chip C peaks and so for that you might have a few thousand peaks where you have sequences of a couple hundred base pairs and that sort of thing will take an hour or two I'm not sure and it doesn't use a Gibbs sampler, it uses meme which should give you the same answer every time based on the way they so they use an alternating approach but they formulate it slightly differently. So one of the things about meme chip is it does restrict you to having a couple hundred base pairs for your search window and that's often what's going to be the thing that increases the complexity of this the most if you used instead a few thousand base pairs as your search window would take quite a bit longer. Alright so let's say you have this motif, now what do you do with it? Well the first thing I would do with any motif that I discovered in Evo is I would use a tool called Tom Tom this is part of the meme suite and so Tom Tom can compare your motif against everything in Jasper, if you have Transfac you can you can do that as well and tell you how well your motif matched against those sequences so this is a good thing to do if you think that you may have a new motif you can start by searching against something like, you can start by using something like Fimo to just search against what's in Jasper already although sometimes you'll find that in individual sequences there might be slight changes, there might be a difference between a motif that you discovered using an overhead light protein binding microarrays versus a motif that you discover using your particular set of chip seek data in your particular system of interest. So this model let's talk about how good it is. So in 1997 Trunch and colleagues tested 50 predicted transcription factor binding sites using an overhead binding test and found that sites predicted using methods very similar to what I've shown you here, 96% of them were bound. There was some work by Gary Sturmo where he found that the best position weight matrices actually, even though it's defined usually using information theory it recapitulates what's going on historically and that you can often have essentially additive facts by column in the biophysical model in a way that's very similar to what's going on in the information theoretic version of this model. So that's good. It's bad is that if you look at one of these transcription factor binding site motifs and you just try to predict where the binding sites are across the whole genome you will find a lot, a lot of them. And here's a particularly bad example. So if you use a whole set of profiles from something like an early version of Jasper and you predict binding sites across the gene you are going to find lots and lots of places where some transcription factor binding site or another might bind. Really the whole gene here. So even though people have shown that this model really works quite well in vitro in vivo these predictions that you make based with sequence alone are almost always wrong. So YF Wasserman has called this the futility conjecture. You can come up with these wonderful models to do this sort of thing in terms of actually helping you predict in the end which transcription factor binding sites, which transcription factors are going to bind to a sequence of DNA. It doesn't really help you. So with this knowledge what might you do to improve your predictions? Any suggestions? Yes? In the great last six days so you know which one of these motifs are actually about so then why are you doing prediction? That's really cheating it's like you know let's avoid predicting by using a gold standard measurement instead. I mean I like the way you think. In some ways that's a really good answer but in other ways sometimes you can't do chip-seq. What's that? So other suggestions, yes? In a great dependency between positions. Stop assuming the positions are independent. I mean you could do that. I think that will maybe increase in vitro, increase something like this 96% to 98%. But in the end you're still going to have stuff like this. Any other suggestions? You want to reduce the number of false positive rates. How do you do that? That is definitely something you could do. You guys are obviously way too sophisticated because the really if you have some sort of method for scoring things and you want to decrease the number of false positives there's a really easy way to do it. What's that? You pick the top ones. You have some sort of threshold and you increase the threshold so that you only get ones above a higher threshold. So after I've rated you guys into giving you that answer that's not going to work. But I think it's important to say why that works. So actually if you make the more stringent threshold that doesn't actually help things very much because it's not actually poor fit to this model in terms of the actual local sequence that's keeping us from having predictions that match with reality here. It is actually other aspects. I think Brian's answer there is actually a very good answer if you can incorporate information about where actually regions of open chromosome that's actually going to help you the most. We'll talk about that a little more in a second. Alright, so what have we learned in this part? So first these models are very, very good for defining what's happening in the right. In certain cases in Vivo I think they're going to be a very good model of what's going on as well. But there's other information that you need to determine whether this is a place within the genome where this is a good match. Alright, so a little bit later on into this lecture we'll get into some of those ways that you can use additional information. But for now I want to talk about something a little bit more practical which is, you know, let's say you have that very problem that I presented before which is you have a set of co-express genes and you want to figure out what's driving their regulation and you actually are not really interested in discovering a new motif or writing your own Gibbs sampler you just want to figure out what, you just want a good hypothesis as to what might be driving this co-expression. Alright, so let's say you have a set of co-express genes, alright, and you got this from some RNA-seq data or microarray data or something. Alright, you have a set of genes that you know are not actually co-expressed in these same sorts of conditions either they aren't they constantly aren't expressed under these say two conditions you're looking at or they constantly are expressed and so on. Can you figure out what motif is driving the co-expression? Alright, so here we'll have an example of one motif and you find that motif in various places in the co-expressed genes maybe once in upstream of this gene and a few times upstream of these genes and you find it never in the negative controls. So if you know this is happening this is actually a pretty good indication that this motif might be responsible and certainly worth looking into further. Alright, how do you do this? There are a variety of different tools. We are going to show you a tool called Iregulan which will solve some of this problem later this morning. Silver and Equal will lead a lab that shows how you can look at a bunch of genes that you think might be co-regulated and Iregulan is a plugin for side-escape. So you can get your set of co-regulated genes from something else in a side-escape workflow and then try to to find whether there are motifs that are upstream it will also incorporate actual chip-seq information so you don't have to rely just on the motifs if you have something more and so on and there are various other tools that can do this sort of thing. You know there are tools within the Meme Suite, there are tools like Opossum but we are going to show you Iregulan because it hooks into a side-escape and we have already spent some time working on that. Another problem that we have and this is a problem that even goes beyond what you can address by finding regions of the chromatin is that there are sometimes transcription factors or sometimes families of transcription factors that will bind very similar sequences for example the ETS family or the GATA family, GATA 1, 2, 3, 4, or 5, it's very challenging to figure out whether a particular motif is GATA 1 or GATA 3 because you've got the GATA in either case. So here's an example of the ETS family, lots of different genes, how do you find which one there is software called TopGene that will help you do that sort of thing. So if you ever find yourself in that sort of problem that's one way to do it. Another way, you know, something we've been working on in my lab is trying to find differential models between different transcription factors like the GATA family by developing models using knockout data. So you can come up with a better motif for something like GATA 3 if you collect chip seek data in cases where you have GATA 3 knocked out or not, or maybe you have GATA 2 and 4 knocked out and so on and so forth. So it helps you refine these sorts of differences that are otherwise hard to find in DETRO. Alright, any other questions? Alright, so move on to the last part of this lecture which is incorporating information about the biochemistry of gene regulation. So all this wonderful data from things like INCODE that I showed you at the beginning how can you incorporate this sort of information and limit your search space if you're going to look for regions where a transcription factor might be bound. So one thing you can do is you can use annotations from tools like Segway. Segway is a computational method from my lab which will take lots of different kinds of data from INCODE. So INCODE for some cell type, they've generated chip seek data on lots of different histone modifications, open chromatin data, things like PALL2, things like CTCF. Segway is something that does unsupervised pattern discovery and figures out what sort of patterns you find across multiple data sets at various positions and then turns that into a annotation that is cell type specific. I'll tell you where the starts of genes are, or the answers are, or repressed regions of the genome are. And then you can visualize what was here maybe 30 different tracks, 30 different data sets that each have different patterns like this. Instead you can turn it into this much more digital pattern where you say okay this looks like a gene start and this looks like a repressed region and so on. And that makes it easier to interpret things. So here's an example of interpretation from a GWAS we're trying to figure out which single nucleotide variance at issue are in various categories outlined by Segway. You can also do this in trying to understand where transcription factor binding motifs work. So if you are scanning for places where your particular transcription factor might be binding, you probably don't want to scan the things that you already know are quiescent in your cell type. Instead maybe you want to focus on the enhancers. Maybe you want to focus on the things where there are epigenomic signals that they look like promoters. Or you want to focus on regions where you know there's open chromatin and so on. So Segway will get you that. Here's another representation of the Segway data where it says this region right here is regulatory. These regions are TSS flanking. And even though we have this information from GenCode where we can see there's a bidirectional promoter here, Segway doesn't actually incorporate that information into its analysis. So it's much less biased towards what people already know. So you can find regulatory regions like this in plenty of places where there are an actually transcription start sites annotated. Also the problem before of what's the promoter well we'll just take the TSS and go 2000 base pairs upstream. Now you don't actually have to do that anymore. If you have epigenomic data for your cell type you can actually see which regions look like they have promoter activity and focus there. So you can go to Segway dot often lab dot org if you want to look at the annotations you can click here you know it will load into the UCSC genome browser and you can see here yet another representation of Segway data where we have you know more than a dozen different cell types here and then colors represent different sorts of Segway categories. So this is the alpha globe and locus right so here's HBA and you can find here this region seems to be have repressed transcription activity in most of these different cell types with the exception of here's H1 embryonic stem cells where there seems to be some expression there seems to be some transcriptional activity at the alpha globe and locus and then here's H5.6.2 which is a blood lineage cancer cell line where the whole region is lit up like a fireworks display it's very active all through this which is what we would expect hemoglobin very active and the blood precursors or cancerous versions of blood precursors and so on. Yes? So the question is you know what you know this is great well this is great isn't wasn't part of the question but I'm going to say it anyway this is great for you know this location those cell types where you already have a bunch of data and someone's already done this for you you know what if you don't have any data like that so I would say you know and what should you collect right what sort of experiments are best. I think the single best thing you can do is collect open chromatin data right so if you can do a tax seek on your cell type or you know organism of interest that is the single best thing you can do and although I no longer do wet lab experiments myself I am told a tax seek protocol is really straightforward fairly easy time doing it sorry what's that it's hard to I'm not really sure I mean it's relatively small proportion I mean you know I would estimate something like you know 10% of the genome is near an open chromatin region in some particular cell type right but it changes from cell type to cell type right so if you add up you know all of the different cell types in some organism like say encoded you might find that say you know 50% of the genomes consistently open chromatin and some cell type or another these are all rough estimates even though I know I'm being recorded and put on the internet please don't quote me on this sort of thing so you know if you can get open chromatin data that's great you won't need anything like Segway if you want to do a little bit more you know doing histone modification chip seek is the next thing that I would do and you know for something like Segway I think it's useful as soon as you have two different data sets you want to integrate if you just have one data set then you're probably better just be calling that one data set but if you know you want to add things like H3K27 acetylation H3K27 and H3K3 H3K4ME1 H3K4ME3 H3K36ME3 that would be kind of the starting set I would start with and a lot of this depends on how much you can afford and nice thing about histone modification chip seek is often a little easier to do than the transcription factor chip seek that antibodies are more established and yeah what's that they're better yeah so it's you know more within the realm of possibility to do that than certainly to do chip seek for every transcription factor you might be interested in. So in something like Segway I actually have a grad student who's working on an RNA seek version of Segway which is called Seg RNA so you know one thing to consider all of these chromatin data sets it's not really stranded right so you know if you have a region of chromatin it's open chromatin on the plus strand and the minus strand but if you want to incorporate RNA type data you need a slightly more complex model and there are questions on this chromatin stuff or encode so another tool that might be of interest to people and this is not strictly speaking motif analysis but you know if you have some regions of interest and you want to figure out you know what sort of biological process or molecular function ties them together you can use a tool like rate so this allows you to do the same sort of you know it's very much like Jean said enrichment analysis except instead it's genomic region enrichment analysis so great has a you know slightly more sophisticated approach to take a bunch of regions that you got from something like an enhancer assay or chip seek data or looking for ultra conserved regions across organisms and it will map the annotations to individual genes to entire nearby regions right and it will you know tell you how close these regions are to TSS's and it will give you a set of gene ontology terms or it can also get from other sorts of annotations like pathway commons right so you can see that whatever data set was given here gives you things that are religiously immune system and so on so one other thing that I think is should be of interest is you know even beyond finding regions of open chromatin one big part of the puzzle in mechanistic models of transcription factors binding to sequence is transcription factors often work together right so it's often not just one transcription factor that binds a particular position often there are pairs of transcription factors that work together sometimes there are transcription factors that compete with each other but in the cooperative case what you can do is look for a couple of different transcription factors that you find consistently near each other maybe with some consistent spacing between them right and you can imagine how this might work within the nucleus if you have some transcription factor here right and it has some sort of rigid interaction with another protein here you're going to find that these two motifs are always going to have a very similar spacing between them because it's defined by the physical interactions between these two transcription factors so there's a tool in the Meme Suite designed for this particular problem which is called SPAMO and it will take some sequences and you know data from motifs from something like Jasper or Unipro and it will tell you how close you find one transcription factor motif to another and it's a good thing to look for if you want to find any of these telltale like we always find this motif exactly 11 base pairs away from this other motif sometimes you find it sometimes you don't So for example we have the question of like a set of genes the promoters of these genes the hypothesis is that two or three transcription factors referentially bind the promoters of these things compared to anything else in the gene if this is the approach or you would do something like so the question is you know should you start with the hypothesis that your co-regulated genes are regulated by a couple of different transcription factors I think it's a good thing to look at so you could use if you have a set of genes and you're trying to make sense out of it why not use all of the different tools at your disposal like you certainly can do things like gene set enrichment analysis which almost by definition is not going to tell you anything new about any of these individual genes because it relies on annotations that someone had to curate for all of these individual genes so the nice thing about taking say a sequence motif discovery or identification approach is you might find in this set of genes there's some motif that maybe no one has found before and it might be the sort of thing where because of the utility conjecture you can find any individual motif in the promoter region of any particular gene so someone might not have noticed that it is enriched except in the set of co-regulated genes that you have the statistical power to actually detect any sort of meaningful enrichment of this particular motif so you should do it all yes question about so chromatin modifiers like HDAX and hats so the question was if you have some chromatin regulator you're interested in and you want to figure out what transcription factors might be interacting with what's the best way to do it so you need some sort of data on the chromatin regulator so the chromatin modifiers to start with if you were to do this in a genomic way you would need something like chip seek on your chromatin regulator so you could do chip seek on your chromatin regulator and then you could look for motifs within the regions that you chipped so you could use something like meme chip and you might discover a motif and a motif is probably not going to be a motif for your chromatin regulator it will be a motif for something that you're some transcription factor that your chromatin regulator interacts with so that's what I would do here I mean of course you could also take other, you could take proteomic approaches as well if you did something like oh what is that technique called I have all these colleagues who do this biotin associated technique I remember this call that uses something called proteomic to produce there's some sort of name for using this in proteomic, yes you can use a proteomic technique to figure out different proteins that interact with some bait you have an interest like your chromatin regulator what's that as well but if you wanted to do it based on genomic data I would think doing chip seek of your chromatin regulator plus meme chip would be the way to go and then after you identified those motifs that would make it much easier to do sort of targeted validation experiments for your hypothesis that the transcription factor that binds to the motif that you discovered is interacting with your chromatin regulator which I would want to do before submitting something like this for publication other questions, alright so yes you use these tools to look for repressors too yes you certainly can if your repressor has a well-defined motif then it works just fine if your repressor interacts with a transcription factor that has a well-defined motif it will work just as well to some extent these methods they aren't going to distinguish in vivo between something that primarily interacts with the sequence and something that interacts by virtue of the same transcription factor every time alright so I've shown you some of the state of the art big challenges that are ahead one is that the action of transcription factors is very cell type specific things like encode have gotten us more data on individual cell types if we want to understand how transcription factors control development we are going to need a lot more data on stuff like that we are going to need models of how individual transcription factors start to become expressed in individual cell types and how that leads to a cascade of other transcription factors engaging in transcription initiation genetic variation in transcription factor binding sites is the topic of interest to people especially now that people are starting to tap out more of the genetic diseases that are caused by changes in protein coding sequence there is a lot of disease that is probably caused by variation in non-cutting sequence that acts by virtue of causing some transcription factor to bind at a position or not people are developing methods that try to predict how different variants will have transcription factor binding sites both in genetic diseases and also in different kinds of cancer because it can also be something that drives different gene expression programs there as well integration of data sources like encode or roadmap of genomics we still need to develop more a lot of these methods were developed before we had those sorts of data and while we can do simple things like take all the segue enhancer labels and only do our motif search there that is a relatively unsophisticated way of dealing with this data we better have methods that incorporate it more directly and finally there is ongoing interest in transitioning away from these matrices that assume independence at every position this is your question at the very beginning we are finally back to it like energy models that can assume some interdependence between positions I have seen recently some k-mer models where instead of having a parameter for every position and base combination you just enumerate every possible k-mer and have a parameter for each one of these k-mers and if you can collect enough data from something like protein binding microarrays that works fine and if you have limited data because you are using something like tipsy if it works so I think that will help a little but probably an integration of things like open chromatin data is more important and finally there is still a lot of complexity that we need to incorporate and I think one of the biggest areas of research and understanding of regulation over the next few years is going to be the genome is three dimensional we keep seeing more and more high C data people may have seen two weeks ago that incredible single cell high C data method where someone got a complete model of where every base genome was in three dimensions and one particular nucleus people are starting to get better understandings of how the 3D architecture of the genome changes from cell type to cell type within the development of some particular organism and that is going to become very important especially in reducing the futility of trying to figure out exactly how transcription factor binding motifs drive differential transcription and gene expression so I'll just leave you with a few reflections mention the futility conjecture again these models work very well in vitro they work very well in vivo under defined conditions but you can't just do a scan of a motif you've identified genome wide and expect to come up with useful results so if you want to come up with useful results you need to add additional information so one thing people can add is they can make the search space smaller by say identifying co-regulated genes using some other method another thing you can do is by focusing to regions about the chromatin within some cell type and if you do have that sort of smaller search space to start with this can be a very effective way of figuring out what transcription factors might be driving the activity of identified and be very careful how you do it so in the lab Vernique will show you how you can use iregulant to accomplish some of these things and for now I'll open up to any final questions