 I'm Michael Hoffman. I'm a scientist at the Princess Margaret Cancer Center, and I'm an assistant professor of medical biophysics and computer science at the University of Toronto. And today I'm going to talk to you about gene regulation network analysis. So there are several objectives to this learning module. We're going to talk a lot today about transcription factor binding and how you can identify transcription factor binding sites with computational tools. And also, what are some of the things you can do using computational tools? And also, one of the some of the things you can't do. And we'll talk a little bit about using tools like Great, and then after this lecture, we'll have a lab where we will learn how to use iRegulon. So we want to actually learn how to use MumeChip even though that's what it says. We've replaced it with iRegulon because it is cytoscape based much like the other tools that you've been using during this workshop so far. So there are five different parts of this lecture. So I'm going to give you an overview of the biology of transcription. Then we're going to talk a little bit about how you can predict where transcription factor binding sites are, then how you can find novel examples of transcription factor motifs, how you can discover that, and then how you can use chip-seq data along with transcription factor models to identify transcription factors that might be important for some process that you are interested in. So this will be directly anticipatory of the lab that comes afterwards, which will be on combining that with gene regulatory networks of the type you've been using the rest of this workshop. So, and of course as always, if you have any questions at any point, please feel free to raise your hand. So first, introduction to eukaryotic transcription. So how many of you hear work on eukaryotes? Everyone? Are there bacterial or any bacterial or are kale people? Yes? Okay, well we're going to talk mainly about transcription and eukaryotes here. So you know, it's quite some of the same processes work in bacteria, but it's quite a bit more, quite a bit similar. Sorry, quite a bit more simple. So basically, transcription in the eukaryote, if you really oversimplify things, does this show up? Mouse? Yes. Okay, so it involves a series of transcription factors and each one of the transcription factors, or at least many of the transcription factors, are sequence specific. Right, so here we have a transcription factor, this dark blue block, that is specific to a C-A-G-G-T-A, you know, it seems to prefer binding to something that looks like that. And it's a little bit degenerate and that you can probably change some of those those bases and its binding motif and it'll still bind, but it might affect the binding energy, how well it binds, how often it's actually bound versus going off into solution and all of those things will affect these downstream effects. So this transcription factor, maybe along with other transcription factors, will recruit RNA polymerase 2 and then RNA polymerase 2 will go along the DNA and it will produce your RNA transcript. So this is kind of the simplest possible model for eukaryotic transcription. There are, in reality, it's a slightly more complicated, so let's add another layer of complexity here. So first the parts around the transcription start site can be divided into different things such as the core promoter, which usually has several different types of transcription factor binding sites. And there's usually or often a distal regulatory region that's not nearby the core promoter that has transcription factor binding sites as well. And sometimes those can also be downstream from the transcription start site. So they might be a distal regulatory region and an intron downstream within that same gene, or maybe even in a different gene altogether. So all of this means that you can't just look at whether one transcription factor is binding. There are estimated to be between 1,400 and 2,000 transcription factors and humans. We have not even characterized all of them yet. So we might be off by several hundred. There's a lot we don't understand. They all work together, but in reality it's a bit more complex even than what I just told you about before because you have to consider the fact that eukaryotic DNA is organized in three dimensions. So not only do you have all of these proximal gene right here, sorry proximal transcription factor binding sites here, and you have the distal transcription factor binding sites, but the reason the distal transcription factor binding sites are able to affect transcription right here is because they aren't truly distal. In three dimensions they're actually proximal. So they're right here, but there might be some big loop, some big chromatin looping that occurs that makes it look in one dimension like it's very far away from the transcription start site it's activating, but inside the actual nucleus it's very close within three dimensions. But of course whether that is close or far, that's something that can change. So the proximal promoter is always going to be right here, but if some more of this, to use a metaphor, string gets let out, we might find that this part is now proximal to the transcription start site instead. So speaking of strings, you might see that this is wrapped around all of these nucleosomes and that's another thing that will affect whether certain transcription factor binding sites are active or not. So the transcription factor binding site is wrapped around nucleosome, it might be inaccessible to some of these transcription factors. And some transcription factors are better than others at knocking away nucleosomes and being able to actually bind to things that were previously bound by nucleosomes and those are called pioneer transcription factors. So we need, in order to accurately, yes, go ahead. Can you just give an example of a what? Can you give an example of the pioneer transcription factor? No, I can't. And some people, there's often dispute over the pioneer transcription factors that exist, but that's one model that they do. CTCS probably works quite well. So there are all of these different parts. So having the sequences is not enough. If you want to understand how transcription is, what sort of ingredients are there to allow transcription, you need to understand all of these other things about the structure of chromatin. Yes? Can distal enhancers be on another chromosome? So I am not aware of any examples right now of that happening. There's one model that, say, within the human nucleus, there are these so-called chromosomal territories where you'll find specific parts of multiple chromosomes always tend to be together in three dimensions. So there's probably some sort of regulatory effect there, if that is in fact really happening, but it's probably something people wouldn't refer to as an enhancer. So it's another thing to remember about this is a lot of our understanding of 3D interactions of the genome are really in their infancy right now. So we're just starting to get really good data on what the genome actually looks like in three dimensions, what is consistent from cell to cell, what changes in what's noisy. So there are a lot of questions about how things work in three dimensions that we don't know. So yeah, I'll just say I certainly could see that as a possibility, but to my knowledge no one has directly shown that as a mechanism for gene regulation yet. So yeah, there are a lot of ingredients that we can look at. So one, simply the RNA. So we're thinking about this as one part of a fairly complicated machine, but in the end the way, so I have a computational biology lab and I'm focused on understanding transcription. And I think of it as you have a machine, DNA sequence comes in and there are various other sorts of inputs, maybe in the form of transcription factors, maybe in the forms of epigenetic modification. And what comes out is RNA. So sometimes there'll be RNA expressed transcribed here, sometimes they're wrong. That's kind of the output of the transcriptional machinery. But it interacts with all of these other things like epigenetic marks, so histone modifications. You can directly use chip-seq to look at where RNA polymerase 2 is occurring. You can look at where five prime ends of RNA are being produced using a technique called CAGE. And that can be very important in understanding how transcription actually occurs. Because RNA-seq won't actually give you a picture of where transcription start sites are. It won't give you a picture of where RNA is. So if you just try to infer where transcription start sites are from RNA-seq alone, you can be misled because there are things like exonucleases can come and need away five prime end of your RNA-seq, for example. So CAGE is a technique that looks for a five prime cap to an RNA and is supposed to tell you where the transcription start sites are. So these 1400 to 2000 different transcription factors I mentioned, you can figure out where many of those are using chip-seq. Chip-seq relies on having really high quality antibodies against your analyte. So often you cannot figure out where things are with chip-seq because we don't have the high quality antibodies for that. So there's a few hundred transcription factors out of the 1400 to 2000 that we have really good chip-seq data on. So RNA-seq data, people can use it to quantify which genes are expressed and to a certain extent which transcripts are expressed. So the data is going to cover all of the exons, depending on how you do it, it'll cover up the whole gene, but exons of the transcripts that are actually expressed. So you might be able to figure out... So I think you have a good idea of which genes will be expressed. You might have a good idea of which is the first exon if you're worried about alternate sites. But something that I think is not often taught is that transcription start sites in eukaryotes often are not sharp at one particular position. Often there's a range of transcription start sites where the first exon can begin. So it might begin here, it might begin 100 base pairs upstream, it might be in 50 base pairs downstream, and certain kinds of genes are more likely to have these sort of broad transcription start sites. And trying to figure out where exactly the TSS was and therefore which transcription factor binding sites and apogenetic factors are causing transcription is probably not going to be very easy just with RNA-seq data. So that's why people like HDA. So we got all this stuff around promoters, transcription factor binding sites, and if you want to figure out stuff about regulatory regions, we can, again, do chip-seq to look for apogenetic modifications. So there will be a different set of histone modifications that you'll find at enhancers. You'll also find things like enhancer RNA. So you can use RNA-seq to figure out if you do it the right way to figure out which regions are maybe driving enhancers and are not actually driving gene transcription themselves. You can also look at, so they're co-activators, they're a number of, I might call them, transcription factors, certainly proteins that are known to interact with distal enhancers. So the best known of these is P300, the EP300 gene, which is, and it's 300 kilodollars, it's this huge scaffold that will allow lots of different transcription factors to kind of interact together in the same region. So that is often referred to as a transcriptional co-activator. So, yes, I can do that. So, I don't have any slides on it, though. And can I erase some of this stuff, or is there a good place for me to do this on the things that are easier? Okay, so we'll just consider this line as a piece of double-stranded DNA. We'll make it simple. We know it's all tied up in nucleosomes and compacted, but we'll just keep it simple. So let's say there's some protein that we are interested in. Let's say, for example, here, it's a transcription factor. And we want to know where are the locations of that protein. So what we do in chip-seq is we take ourselves and we add something like form 1 to do some cross-linking. So we make all the proteins stick very well to the DNA. So they're now chemically modified. So they're stuck in there. And then we will do what's that? We'll do sonication, or we'll do something. You can do other things like M&As digestion. So we'll chop up the DNA into bits. So you might try to get bits of DNA. There are 200, 300 base pairs in length. And then we have some antibody that's specific against our protein adventurous or histone modification or something. And maybe there's some off-target effect right there. Maybe it doesn't work so well, so it doesn't bind to this one. And then you do an immunoprecipitation pull down. And that's the chip, means chromatin immunoprecipitation followed by sequencing. So after doing your immunoprecipitation, then you de-cross-link. You purify the DNA. So you're left with these pieces of DNA. And then you sequence. So we'll sequence. We'll get these ends. We'll get these ends. This is a very strange thing. And then that might get ignored during your processing because it doesn't really look like how we expect it to. And the other stuff, you'll have done this on probably with traditional chip-seq protocols, hundreds of thousands to millions to tens of millions of cells. So every place that there's actually a consistent transcription factor or mark, you'll get kind of a pile-up of reads on one side or the other. So then you'll get something that looks like this depending on which side you're on. And then you can use tools. There are a lot of different computational tools, like Macs that you can use to combine these two sides and then get back an idea of where your original analyte was. So that's how chip-seq works. Any questions about that? Yes? What's this strange thing? This thing? This? Oh, that was just somewhere where... So this is an imperfect protocol. So this is somewhere where you got a piece of DNA through the process that didn't actually have your analyte bound to it, so it's a little bit of noise. But since you did this on hundreds of thousands of cells, you only get one of those instead of thousands of it. Also, it's really short, right? So these other things, like you expect this peak to be, you know, that distance from this peak and you won't find it for that long. Yeah, it's an artifact. Yeah, so it's not... No, it's an artifact that will come out of the biochemistry, but it will be processed away during computation. So chip-seq, that gives us a good segue into the next part of this, which is mentioning places where you can get this sort of data. So there's a fair amount of chip-seq data, especially for humans, mice, drosophila, worm. You know, people are starting to make more for various model and non-model organisms. You can find a lot of the human and mouse data, human mouse, I guess some of the flying worm data, also at the encode project, encodeproject.org. So they have done thousands of experiments, many of them chip-seq on various transcription factors and lots of different cell types. So that's another thing to consider is that, you know, the part, you know, if you have a human, like which transcription factors and histone marks are found in liver cells is going to be very different from, say, brain cells, and that can be very important in figuring out how transcription factors interact. Roadmap epigenomics also has a fair amount of chip-seq data, mainly focused on histone modifications, you know, and there are other things like UCC Genome Browser has a lot of different interesting data sets to look at. A lot of these are in sort of the regulation category if you're looking at a human or mouse genome browsers. So that's sort of data that you can get on transcription. Yes? What are we looking at as the whole complex? Well, I don't know. What do you mean? Well, you take your cell in a particular condition and then you do the chip-seq analysis. Yeah, but you're only doing chip-seq for one at a time. For one transcription factor at a time? Yeah. Yeah. How? Well, you've got the antibody for one transcription factor. Yeah. So you can only look at one at a time. I mean, you can look at, like, if you use something on mass spec, you can look at all the proteins that are in the cell, but it won't tell you where they are. Right? Yeah. So you've got to choose between figuring out, at least for right now, you've got to choose between figuring out the totality of all the transcription factors and epigenetic modifications somewhere, and figuring out where within the 3 billion positions in a mammalian genome you might find it. Yes? Yeah. I think we'll get to some of that later on. I'm just kind of setting the stage for sort of the biology and the sort of data that we can get. So we'll move on to the models unless there are other questions at this point. So, yeah. So that's exactly the next question, is how do we figure out where transcription factor binding sites are? How do we do it with some of this data? How do we do it maybe without some of that data and use some of that data as sort of validation? So let's look at a transcription factor. As I mentioned earlier, transcription factor binding sites are degenerate. So you might find, say, a transcription factor, maybe you find somehow, whether it's by chip C or chip PCR or something, that it binds to this single site that has A, G, T, T, A, A, T, G, A. All right? You're also probably going to find it binding to a lot of other things that are similar. So you can see here, like we have a set of lots of different binding sites for this one transcription factors. And you can see a lot of similarities. This core, A, G, T, T, A, A, T is usually there. But sometimes this first symbol is A or C or G. Some of the other things change a little. So you can represent that as a consensus sequence. So many of you have probably seen R used as an abbreviation for A or G. Who has seen that? You guys, you've all seen that. Or Y means C or T. So there's a whole other alphabet for every possible combination of A, C, G, or T. So like V means A, C, or G. D means A, G, or T. R, as we discussed, means A or G, and so on. So you can represent all of the parts that are variable here. So for example, you know that this fourth position, if you look in all of these, it's always T. But this position before it might be G or A. And this position before it might be A, C, or G. But there's something that we're missing here. So this is a very compact representation, but it eliminates some of the information we have, which is that when you have these degenerate positions, the frequency is not distributed identically. So if you look at this first position and count up how often you see different symbols, you see A 14 times and C 3 and G 4 times. So it's not really a good model to just say, this is V, meaning A, C, or G. So you want something a little more complex. And so people go to this sort of sequence logo representation where you will have a bunch of columns, one for each position within the motif, and at each column you'll see all four symbols, A, C, G, and T, where the height indicates how often that is actually found in your known motifs. So this case where we have column 4, where it's T 100% of the time, that's the highest value here in the sequence logo. And here on the third position, you can see the G is relatively frequent, more common than the A here, and C and T are relatively uncommon. So I think this sequence logo, which you've probably seen many times in papers, I find almost everyone has seen them even if they don't really understand the mathematics behind them because it's a great visualization of this. I think it's really intuitively good to understand. But we're going to talk a little bit about what it means under the hood. So I'll mention this equation is a little messed up here, but I checked and it is correct in your printouts, so you can look there if you get confused at any point. So what we have from simply the set of binding sites, we can start counting up the number of times we see each symbol, A, C, G, or T, in each column here. And that gives us something that we call a position frequency matrix. So we might have, we're going to simplify slightly here, a set of binding sites, so five binding sites, and we always find A in the first position, and then maybe in the second position, we find C two times and G three times, and a third position is kind of a mix up of various things. The fourth position is C four times, T one time, and so on. And so you can represent the number of times you see each symbol in each position as this matrix. Starting from that observation, we can then convert the matrix into something that is a better model for actually searching things across the genome. So the first thing that we're going to do is we are going to take this frequency, so we'll call this FB comma I. So F means frequency, B means base, A, C, G, or T, and I means the column. So this right here is FC three, because it's third position for symbol C. So we have that value, and one thing that we can do is we can divide by the overall frequency of that particular base in the genome, because we want to convert this from something that describes how often we've found this particular base in a set of input sequences to something that tells us how surprised we are to see this base in a genome. So essentially we need a background model. And if you have, say, a CG rich motif and you're looking for it in a CG rich genome, that's not going to be very surprising. If you're looking for it in a genome that is very AT rich, that is going to be a little more surprising. So by dividing by the background frequency of that base, and the simplest way to do it is just to take the number of times you see that base in the genome divided by the number of bases in the genome overall, you can kind of get an estimate of how surprised you are as well. And then one other thing you can do is you can add this SN. So that is essentially a way of waiting how many observations you have. So this SN is something called a pseudo count, and often we'll just add one to everything. So basically we only have five observations. We don't want to set up a model that has a zero right here and says this can never be T based on only five observations. So we'll just add one to everything. So instead this will be 6, 1, 1, 1. And this change from the pseudo count is going to scale based on how many observations you have. So if your position frequency matrix had 100 observations in it, adding that one pseudo count isn't going to change things very much. So essentially kind of flattens out the distribution a little. And then the final thing that you do is you take the log of all of this. And that makes the arithmetic a lot easier. It's a lot faster, and you probably think your computers are pretty fast at arithmetic in general. Well, they are actually a lot faster at adding than they are at multiplying, just like you would be if you were to do things, and especially then dividing, if you were to do things out longhand with big long strings of numbers. And they're also more reliable that way. So sometimes people use these tools and they'll get p values of like 10 to the minus 2,000 or something, which is a number that you can't even usually represent in the sorts of numeric systems you usually use. And people say, how is this even possible? Well, if you do it all in log space, it is quite easily possible. So here is an example of a position-specific scoring matrix, or as people will often call it a position weight matrix. They're kind of two different names for the same thing that you get from this position frequency matrix. And so one thing you can see is that this 5 has a positive score here, 1.6, and it's minus 1.7 for seeing any of the other things that you saw, 0. And you can see here how we might use this position-specific scoring matrix to score a particular sequence for this motif as represented here. So if we want to test TGCTG, we can simply look at each column within the position-specific scoring matrix we've made and add up the numbers. So we have a T here that's minus 1.7, plus 1.0, plus 0.5, minus 0.2, plus 1.3. You get a total of 0.9. Does that make sense? Yes. So there's only one thing that matches. So this is after we've already created... So we have a discovery set, right? And we've created this position weight matrix based on it. And then that's fixed. And now we're applying a new sequence to it and saying, does this sequence match this motif that we've already discovered, yes or no? So there's only one way to do this because we don't look at the highest score. This only fits in one place because this is exactly the same length as the motif. This is 5, 5 across, and so is this. Well, this is T. So this is the sequence that we're testing against the motif, right? So in the fourth column of this, this is T. So that's what we have to do. So, yes? It's a new sequence, yes. So you had a bunch of sequences that you know are, say, bound by Mick, right? For example, one transcription factor. You create this motif. I give you a new sequence, ask you the question, is this sequence bound by Mick, yes or no? And so you use this model that you created from your previous observations to test the new sequence that is unknown binding. It doesn't become this. So this becomes this, right? But the TG CTG is not part of your model. So this is your initial model. You convert it to this model, right? And then you test this new sequence against your model. So this does not convert it to this. You test this T against this. And as you can see, in the first column, you get minus 1.7. So having that T there really hurts you. It says, you know, this T does not fit this model very well at all because the, you know, it should be an A here. Yeah, so that's basically, you know, you take the zero, we added a suit account, we divide by something, right? So basically, so we'll divide by the base frequency, right? So imagine, let's imagine a genome where the bases are pretty equally, there's 25% frequency for everything. You know, you get a fraction here and you take the log of that fraction and you get a negative number, right? So whether this is positive or negative tells you whether the, you know, whether you have something that's greater or less than 1 in log space. All right, so that's, so this is for the case, and this might make it, this next slide might make it a little clearer, where you have one sequence, you know, it's exactly the same length of the motif and you're going to test that. And I'll show you how to do it in a second, yes. Yeah. Well, that's it, you know, it might have made a simpler example to do it with 10 base pairs here as well. So, you know, sometimes you try to expand things on either side so that you make sure you get every informative base of the motif. But it's not a really important detail, but good eye. That is an interesting question that I think we will address a little more later in the lecture. And if it's not answered for you, we can come back to it. All right, other questions on this? Okay, so we had a, we had there a sequence that was the exact same length as the motif. Let's look at something more realistic where we are going to have a big long sequence and we want to scan it for this motif instead. All right, basically what you can do is you can do a sliding window approach. So if our motif, here's a SP1 motif from the Jasper database. All right, so this is 10 columns wide. All right, and what we do is we take a 10 base pair wide window and go along the sequence and we score each 10 base pair window just like I showed you for the 5 base pair motif in the previous column. So you can see here, for example, this third column has a really big value for G. And you can see here that it's a really high score, 2.1222 if you get a G in the third column, similar if you get it in the fourth column and so on, and you can sum up all of this for this little window and you get 13.4. So that's your absolute score for this. That's not, you don't want to necessarily consider that in a vacuum for this particular motif because, again, a lot of this is about how surprised you are to see certain things. So different motifs will have different possible maximum and minimum scores. And so, for example, here the maximum score would be from something like G, G, G, G, C, G, G, G, G, T. What you want to do is you want to see how does this absolute raw score fit in with the max score, which here would be 15.2 for the best possible sequence and the minimum score, which would be minus 10.3 for the worst possible sequence of motifs. So just subtract the absolute score from the minimum score, divide it by the range, and you get some fractions. So you get a relative score of .93 here and the relative score will vary between 0 and 1.0, or 0 and 100 percent. Same thing. And then, once you have this relative score, you can compare it against every other relative score that you calculate for every 10 base pair window in the genome. So you can plot what is essentially a histogram of all of those scores. And you'll find that the mode is about .2 or something. And then your score right here, this score of .93, it doesn't really correspond to the axis right here, it starts looking pretty good. And then you can even turn that into an empirical p-value just by looking at how likely any point in the genome is to be better than this relative score than the worst that. Yes? The lesson starts when you take every 10 base pairs. Yeah. You do a sliding window approach. So you start here and then you move over one and then do this 10 base pairs. And then you move over one and do this 10 base pairs. Well, I don't know whether, I mean, you know, essentially, you know, it's hard to, it's hard to define exactly what is, you know, by chance and what isn't. You know, essentially this is just looking at, you know, what your score is, right? And essentially, you know, how often do you find a score that is that higher or higher? We don't actually know for this, you know, whether these are real binding sites or not. We just know where it fits in the score distribution. And that's what the p-value is based entirely on that. Okay. So, you know, things like this SP1 matrix, I mentioned it's from the Jasper database. All right, Jasper is a database of transcription factor binding profiles, sequence logos, motif files that are equivalent to them and the references. So it's all curated by humans from the literature. So people, you know, include the reference of, you know, where this motif came from. There are a number of these motif databases. So look in the iRegulamon paper. So the paper describing the tool we're going to do in the lab, there's a big table of different databases for it. I would recommend in general using this one if you need to choose because it's very well curated. Some of the other ones are, you know, in my estimation, not as well curated or kind of defined in more dubious ways. So it's what I usually, people in my lab usually use in our own research and so on. So how well do these models, so, you know, the question was, you know, is this a score that's good by chance? How well do they actually perform in reality? All right, so in some sense, pretty good. So, you know, there's a paper in 1997 where they did some in vitro assays to see whether transcription factor binding sites identified with position matrix models or position specific scoring matrix models. Those things are synonymous again. You know, performed and found that 96% of the predicted sites showed up as true H&F1 binding sites in their biochemical assay. All right, and Gary Stormo and Fields found that the best position of main matrices produce scores that are highly correlated with binding energy between the transcription factor and a piece of DNA. All right, so that's good. So at some approximation, the position matrix is a good bio-physical model. To some approximation, it is not. So the most obvious problem is that you find lots and lots of things that look like good binding sites. All right, so maybe, you know, maybe you take such a p-value as your threshold for this of, say, point 1 over, you know, say 1 over 500, like they have here. So that would be a p-value of what? Point 002? Excuse me. Right, so that seems pretty low. But that means that you're going to have a binding site every 500 base pairs on average. And the genome is big because there are 3 billion base pairs in the human genome. Other genomes, e-credit genomes are often similar sizes. This is not really that useful to people necessarily. And also, one might question whether it actually describes what's happening in reality, right? If you look at that chip-seq data, it does not necessarily show binding every 500 base pairs. You can look at, say, this is something that was done a while ago, human alpha-actin gene. So here's the gene model. Here are the exons, red boxes. Here are all the transcription factor binding sites that are predicted from, say, several hundred transcription factor binding site models, several hundred motifs. Binding sites all over the place. To some extent, this might represent biological reality. So they're certainly going to be transcription factor binding sites all over the place. So they're going to be as much over the place as you see here in this particular diagram. I would say no, based on what I've seen from chip-seq data. And even if it were, it's not something, if it just binds all over the place like this, it does not give you any sort of useful information. So then we arrive at something that Wyeth Wasserman once termed the futility conjecture, which is that these predictions are almost always wrong. If you just go based on sequence, you will get a bunch of predictions and your true positive rate will be really good. So you'll get almost all of the true predictions and you'll also get a ridiculous number of false positive predictions. So I remember when I was in grad school, I read a paper that he wrote on this and I said, this is really cool. I think I want to move into this field, which shows you how good my judgment is. But I think I was kind of lucky and that I moved into it at that point, but we can talk about that during the break. So what's more, there's a bigger problem. You might say, okay, your model gives you a bunch of false positives, the stringency of the model. So you get too many things when you have a cutoff of 0.02, set the cutoff to 10 to the minus 6. Cut the cutoff to 10 to the minus 20. Do your p-value cutoff there. Unfortunately, that doesn't help because really it is possible for the transcription factor to bind and in vitro it will bind for anything above some certain threshold. So your 0.002 threshold might be good enough in vitro, which is what this model is based on. But as I discussed in the introduction, there's a lot of other stuff going on. You don't just have one protein and one piece of naked DNA. The DNA is in chromatin. The DNA is wrapped around nucleosomes. The nucleosomes have histone modifications that interfere with things. The DNA methylation interferes with things. 3D GM organization interferes with things. There's cooperation between different transcription factors. Sometimes if you get two of them together, they'll bind to a sequence and otherwise they won't have the energy to do it. There's competition between transcription factors. So there's all sorts of other things that are going on and this is not enough. So I will say that I don't want to disclaim this too much. It's still a useful model if you can kind of set the stage somehow and know that this is a region where transcription factor binding is likely to occur, this model will work. But you can't just use it by itself on a string of in vivo DNA. I'll sum up this second part of the lecture. So first, position-specific squaring matrices or position-weight matrices accurately reflect in vitro binding properties of DNA binding proteins. But in vivo, there are a lot of other things going on that we have to model. So back to our couple of bacterial friends here. This works a lot better in bacteria because a lot of those things that we talked about going on in eukaryotes don't happen in bacteria so it works a lot better. Any questions on the second part of the lecture? Yes? You talked about statistical disturbances. Well, so one thing about this model, there are a number of different ways to kind of represent the model and kind of represent an information theory, information theoretic kind of way. There's a biophysical model that is essentially equivalent to this. So there are a lot of different sort of frameworks, information theory, biophysics, statistics that you can use to understand the model, but essentially a basic model is equivalent in every case. So there's not really anything you can tweak there just with this sort of model. But you can add in other things. No. It's because there are a lot of entities within... This is a model of the interaction between one entity, protein and another entity, a DNA double helix. But in the cell, there are a lot more entities that you have to contend with. 1,400 other transcription factors that are interacting with your transcription factor may be competing with it, may be cooperating with it. The fact that the DNA double helix is not just something that looks like this. It will curve around. There are actually several different axes in which it can change. It can be tighter in some places. It can twist this direction or twist this direction or kind of twist this way or this way and those things can all be affected. There are other transcription factors upstream or downstream. So you could have a much more complex biophysical model and people are working on that and it's kind of work that's not done yet. I'm not sure it will be enough. I think the fact of the matter is there are thousands of entities within the cell. So if you really want a good model of this, you will need something that's a lot more complex than just a two-entity model. There are other questions on this part. So now that I've told you how you can construct a model of transcription factor binding sites and how in practice that doesn't work, I'm going to tell you how to use it further, this model that I just told you didn't work. But I think you might say, why are you doing this if this model doesn't work? So here I'll reiterate the point that this model, in the right conditions, it works really well. It's just you need to have some sort of prior knowledge about a region, some prior knowledge where you know that that region might be important for transcription in order for this to work. So when I showed you how we make position frequency matrices and then from those position weight matrices, we had a really nice set of input sequences. They were all 10 base pair sequences and they all aligned perfectly. In reality, no matter what method you're going to use to generate your motifs, it's never going to be that easy. So in general, you aren't trying to discover motifs across the full width of your input. You're going to have some longer input, whether it came from chip-seq data or protein binding microarrays or C-lapse. And you will have to discover the motifs within those sequences. So here is an example where I have three sequences. There are about 750 base pairs long. And I know for some how I know that they're all regulated by the same transcription factor. It all binds to all of these sequences, but I don't know exactly where within the 800. So our problem, the task here is given these sequences, you want to be able to find the sub-sequences, these short little red sequences that give you the best motif. So we can describe this in a little more detail. So given a sequence or a family of sequences, you want to find a number of motifs. You want to find the width of the motif. That could be preset, or maybe not. You want to find where the motifs occur. So why is this difficult? So the input sequences are long. If you do some nice in vitro assay, you can get it down to something fairly short, but if you're doing something like chip seek, you're going to have lots and lots of long sequences to look at. And also these motifs, as mentioned before, they're degenerate. So it's not as easy as just looking for 10-symbol sequence like you would if you were searching through a Word document or on your computer, because things are not exact. And then you have to figure out how much inexactness are you going to put up with in terms of what you'll actually describe as a motif. And some columns of the motif, it might be very similar, and some it will be very unsimilar. So here, in other words, is an example of how we might do this. So let's say we know that a set of genes are co-regulated. And we are given a set of promoters from those genes. So we've already identified these genes from, say, RNA-seq or microarray or something. We know they're regulated together. We don't have any chip seek data, though. And then we decide to look in their core promoters so we just use Galaxy and take 2,000 base pairs upstream of each of these genes. And we think that there is some transcription factor that is responsible for causing this co-regulation. And it binds to some position within these sequences, but we don't know where. And here's one more thing you have to worry about is we look at these things in these sequences, ACGT, ACGT, right? But to the machinery within the cell, they don't see it as a one-dimensional thing. They can see both strands of the DNA, and it looks the same to that. So we have to find things on either DNA strand. So here are the actual examples of the transcription factor binding site for this transcription factor. So you can see here are some examples, A, A, G, A on this strand. Here you find some examples, T, T, C, A, C, T, C, A, G, T. And over here, that's essentially the answer is those locations. But remember, we don't have that. We just have these promoters, right? So we want to be able to discover a position-specific scoring matrix, but we don't have that. We don't have the locations. We just have the big sequences, and we have to discover the sites. And once we've discovered the sites, it's relatively easy to add them up to create a position-frequency matrix like before. So any questions about this problem before I give you the answer? Oh, I see you like spoilers, and you've already read the answer and the stuff they've provided so neatly for you. So we're going to use something that we often do in machine learning, which is called an alternating approach. So the search space is so huge, we cannot possibly try every combination. So there are 1,000 starting positions in one promoter. We wanted to combine that with the next one, and the next one would be multiplying things times 1,000 each time, and pretty soon you'd run out of atoms in the universe to which you do this. So instead, we do what is called an alternating approach, and we'll use it here, and it's often used in a lot of other machine learning problems where you're trying to estimate parameters. So what we do is we start with some initial weight matrix, and we do that simply by guessing. We just randomly come up with an initial weight matrix. You might do it uniformly, or you might have some sort of prior knowledge that helps you find one that might kind of be good. And then we'll use that weight matrix to predict instances within the input sequence. So that scoring approach, I showed you before how you can scan a sequence and look for the score of a weight matrix versus a sequence. We're going to do that against all of these sequences using the weight matrix that we generated randomly. So we'll do that. We'll look for the best score, the best scoring position, and we'll stop there. And we'll do that for all the other sequences. So then we have these different sequences that have been picked by our method, and then we will calculate a new weight matrix from those, and then we'll repeat this process over and over again. I'm going to do a diagram to show how this works in detail. Do you want to see that first? Do you have a question now? No, I just have a question. Instead of use one, chop it up, and use those as the initial ones. So, I mean, you know, the question is how do you come up with this initial guess, right? So this is just kind of the concept in general. In this case, what we will do to come up with the initial guess is we will guess then we will pick random sub-sequences from within this set of sequences. So we don't come up with a totally random guess. Although, due to the way these things work, even if we did, it probably would be okay. Because, you know, then if we came with something totally random, then the first time we ran through this, we would probably get kind of a random selection of sequences, and then we would be back to this. So we kind of save some time by guessing from within the sequence. So, yes. How do you figure out the size of the sequence? The size? Yeah, so that's a good question, but essentially that needs to be fixed for this. So you might try this with a range of different sizes. You might try this whole process, you know, for 8, 10, 12, 14, and then see which sort of thing gets you the best score overall. Usually, the transaction under binding sites, do they have a specified band of lengths or it would be ending between? Well, you know, I know the shortest example I've seen is essentially 2, and the longest example is probably like, you know, 20 or 30. I mean, you know, one thing to remember is that, you know, DNA is really big. So if you look at a structure of DNA with a bound transcription factor, right, because I often, you know, we see all these cartoons, like the cartoons I showed you before that shows, like, kind of the transcription factor interacting with the DNA, right? If you actually show the double helix, like, you'll see that, like, it can often dwarf the transcription factors around it. So there's only so much of the DNA that one transcription factor can actually bind. So you're unlikely to get really long motifs simply because it just can't bind to that much. But the smallest error seen is 2, like, but that's something we don't really refer to a 2-base pair binder as a sequence-specific transcription factor. You know, it's more like dinucleotide-specific, but that's, you know, it doesn't really have a lot of information there because, you know, how often are you going to see any particular dinucleotide a lot? Yes? Does it mean if the sequence, if you get the sequence-specific antibody factor, then the sequence somehow will be expected to be having a transcription factor and you know, I would, I would say, go on anything that... Yeah. So this is an approach that you could use for, for, you know, any sort, like you could use this for chip-chip-seq data. The particular example I'm giving here, you know, we're just taking RNA-seq data, essentially, but you could use the same, and in fact, people do, use the same approach for chip-seq data. Now, this is something like Meme, okay, and usually for chip-seq data they'll use something called Meme Chip, which is kind of optimized for chip-seq data in particular ways, right? So there are a lot of things I don't have time to go into here, but you can say read the Meme Chip paper. Other information you can take advantage of if you're doing chip. For one thing, you'll actually know which transcription factor is regulating, which you don't, you don't know here. It's an unknown transcription factor. So here, so back to our example, we have randomly picked an instance from each one of these input sequences, right? It's sequence one through five, big s one through five, and we'll call the sequence that we randomly picked little s one, little s two, little s three, and here are the actual sequences here. They don't look very similar to each other. The particular alternating approach we are going to use here is called the Gibbs Sampler, okay? So I describe the alternating approach in sort of broad terms. Here, we will take all of those small s i's, so the subsequences. We will throw away one of them, and then we'll just calculate the weight matrix on the ones that are left, right? So just like we calculated PWMs, PFMs to PWMs in the first place, right? And then we'll have a new motif, and we can use that motif to score all of the sequences, right? And then we will repeat this process, including the throwing out one. So you might rotate through the ones you throw out. You might pick randomly the one you're going to throw out. You might throw away a couple instead of one, for example. The point of that is to keep you from kind of getting locked into something that, you know, one sequence might be kind of throwing off this approach a little. It allows you to have a little bit of kind of fuzziness in something that is essentially looking for, is essentially looking for a local maximum, right? So none of these alternating methods is never going to get you a global maximum. It's not going to get you the single best possible motif, but it might get you something that is within a small margin from the single best possible motif. And the more sort of noise you add during this process to allow it to kind of get off some sort of pathological local minimum, the better chance you'd have of that. So that's the point of that. So I'll show you how this works here. Like, these are similar to these sequences that we found one second. And in this example here, right? You find these subsequences. You calculate position frequency matrix here and convert to position-specific squaring. Just one second, squaring matrix. All right. Then you can go back. You can do the sort of relative scoring thing I talked about before much earlier on and score each one of these sequences, right? And now you find that the thing you've thrown out, there's actually a good score for it here and you add this into your set of sequences and then you throw out the next one and go through all of this. Yes? Well, it says, if it was like, when can we get an instance of SI from these two different sequences? Yeah. Do you mean the two different gates for each of you? Well, they're different sequences. So, yeah, I may choose a different position if that's what you mean. But these are all different sequences. So, yeah, they're different for each position. Any other questions about the Gibbs sampler? How you do this? So, there are other methods to do this. There's another method called expectation maximization or EM that you may have heard in other machine learning contexts of people. It's also an alternating approach. I'm not going to go into the details of it. The basic idea is the same. You guess, you see how well you're, you know, what the best possible observation that fits your model is and you sort of refine the model step by step. So, you do all this on your sequences and then you get a newly discovered DNA motif. How do you know whether that's something people have discovered before? There's a tool called TomTom, which is very useful for this. TomTom is kind of like a sequence aligner, like Blast or Bowtie or something. But instead of working on sequences, it works on motifs. All right, so you feed in a positional weight matrix and you tell TomTom you want to scan against JASP or against TransFac or something, right? And it will take your query motif and it will give you a report describing how well it matches against every motif in the database. I mean, it won't give you a report for every motif. It will find the best matches and it will give you make it the, you know, top 20 or 30 or whatever, right? And usually it will probably, if it does match well, it will probably be one of the first few and you can interpret all of this stuff there. All right, and we won't talk about TomTom more, but TomTom is part of the MemeSuite website. So if you have motifs and you want to solve this problem, then the MemeSuite website is a good place to go. All right, so that's the end of this part of the lecture about motif discovery. Any other questions on this? Okay, so let's move on. You know, we have a discovered motif. Let's get back to the question of how do we know, you know, that a particular motif we're interested in or one of many motifs from a database is interesting for our particular set of genes. So most often set that people use is a set of co-regulated genes. They'll have something like a RNA-seq experiment or a microarray and they're able to identify co-regulated, co-expressed genes. And the co-expression is really observation. Co-regulation is sort of a hypothesis that you're testing. All right, so you have some set of genes that you know are co-expressed, some negative controls that don't necessarily show a co-expression pattern. All right, and you want to test against some motif, right? Since we use computers, we can test against all of them motifs. And we have to do multiple testing correction. But what do we do? So, you know, we, you know, this might be an ideal model of co-regulation, right, where you have this motif and let's say you find examples for this motif here, twice in front of this gene, twice in front of this gene, and zero times near negative controls. Boom, great, great hypothesis there. You know, sort of validated electronically. Now you should go back to the lab and kind of, you know, validate experimentally what you've learned here. How do you actually find things like this in practice? So there are a series of methods that are in this transcription factor binding site enrichment finding methods, and very similar to the gene set enrichment, the go-term enrichment you guys did earlier with G-profiler, right, this is looking for over-representation, not of some go-term, but of motifs that score well for some transcription factor, right? So there are two ways you can do this. So one is just like the gene ontology analysis you guys have done. You can associate each gene with whether the transcription factor binding site, whether there's a good transcription factor binding site score for a motif that you're interested in on that gene or not. Okay, so in any case it will be, you know, the gene will be, will have a good TF binding site upstream or where it won't, and you can do very similar sorts of, just as tool test as you did with G-profiler, just with this information instead of go-term enrichment, and look at how often do you see this in foreground versus background. But it's not exactly like go-term enrichment because you can also have multiple transcription factor binding sites per gene, and that can often be a really good indication that a transcription factor is actually biologically relevant at a particular position, even if you don't have extra stuff like, you know, DNA-seq data or, you know, attack-seq data telling you that some region is actually open chromatin. You know, the fact that you can find a really big cluster of the same transcription factor binding site can be really indicative that it matters in a certain place. You know, so you might find in your foreground cases it occurs a lot more often than in the background cases. So these are essentially two different ways of looking for enrichment. All right, so there are two different statistical tests that are used. The binomial test will look at the number of occurrences, and a Fisher's exact test will look at the number of genes. So if you use a tool like opossum, and you can, I think there are links on the wiki, but you can search for it certainly. Opossum is a website in which case you can put in a set of co-expressed genes to just put in the symbols, like BRCA1 and GOTA2 and CHEK2, and then it will do sequence retrieval from ensemble. It will look for regions of phylogenetic footprints, so regions where the gene, the sequence is conserved between multiple species. That's the main thing opossum does, which you may or may not want to do. Often people are not these days. Detect transcription factor binding sites, tell you how significant they are, and then maybe you have a hypothesis for transcription factor that matters. Yes? Does it publish the paper? No, but we should offer a workshop on doing reproducible science with notebooks and stuff. No, so Wyeth Wasserman and his group developed opossum. So let's look at an example of how you use opossum. Opossum is just one tool. There are many other tools that work in a similar way. We'll talk about irregulant, which has some similarities here. So let's say we have a set of genes that we know are muscle specific and a set of genes that we know are liver specific. So we feed in a bunch of gene IDs. These are not the gene IDs. So we fed in a bunch of different gene IDs and it automatically goes and finds the sequence that's upstream of that gene and then looks for transcription factor binding sites and says, okay, so this SRF transcription factor occurs really often upstream. So does MF2 and MIB and MYTH and so on. Whereas these liver specific genes, you'll often find say HNF, which is actually a known hepatocellular transcription factor. So it's known to be important for liver. So this is something that previous opossum paper and the red arrows indicate that they've experimentally validated that these are important. And you can see the different kinds of scores, whether you do it at the binomial test of numbers of genes, sorry, numbers of occurrences or the Fisher's test, which is the number of genes. Did you have a question? Yes. Your input was... Yeah. And for those 23 genes, what they have in common are those four transcription factor binding sites? Well, these 10, like the arrow, so the arrow is something that's kind of added later in the PowerPoint. Because we have, you know, like if you search the literature for SRF and muscle, you're going to find a lot of stuff. That's what the arrow means. And no, this does not do that for you. So that part you'll have to do before you write your paper. So let's talk whatever the top five or six are significant. Well, I mean, you know, you can go down as far as you want, kind of. So we think of these things, it's hard to kind of set a strict line for significance here, right? These all have, you know, decent... These are the p-values, right? They're all less than 0.05. I don't know exactly what p-value each of these z-scores correspond to, but I assume they're all significant, right? But it's usually best to look at what's at the top of the rank. Because, you know, at some point, if you set any sort of decent significance threshold, the things at the bottom of the list are probably going to be less useful. So start at the top and go down. So it's not saying that every single one of them has it, but it's saying that there's a statistical enrichment. So if you actually used a possum, I think it would actually tell you how many genes have a good pining site. But, you know, that, again, is kind of a threshold matter. Because depending on how low you set your threshold for a pining site, you can say any gene has any pining site for things. So it all kind of combines it. Yeah, but I think, you know, with a lot of these genome-wide analysis tools, it's a much better approach to kind of look at the things that are ranked best and kind of go down the list rather than say, this is my significance cut-off. Because, you know, you'll often find a lot of stuff that maybe meets your significance cut-off, but is way less important than the 20 things above it on the list. Yes, you have a question. So I think that the 23... No, unfortunately, I can't explain that. This slide was made a long time ago. I'm not actually sure... It seems surprising that this would actually refer to number of genes. And I don't know whether those numbers at the top refer to the genes in your input set and the transcription factor matrices that are used. But if you want to use this, you can look at... You can go to the opossum website and they have tools that you can paste in your gene list. And I'll do a lot of this stuff for you. And they will also explain things like that in their documentation. So... One second. Okay. Well, we are running a little short on time, so that we preserve our break through this next little part. You guys ask a lot of questions. Every other time I've done this in the several years I've been doing this, we've finished about 30 minutes earlier. So I've been thinking I need to add a lot of material to the lecture, which apparently I do not. I just need better students like you guys. So... Yeah, so there are some transcription factors that are basically indistinguishable for each other. For example, ETS. There are several different ETS transcription factors. They all have these six or seven base pair motifs that are very similar to each other. How do you find... If you find yourself in this problem and you're looking at ETS transcription factors or, say, got a transcription factor or another good example of this, there is software that you can use to prioritize this, and it's called Topgene, which I won't go to any further. So let me go into the final step here, which is how do you bring all of this together? So one thing that you might want to do is, as I've said it from the beginning, you want to narrow down the regions of the genome that you want to look at. Instead of just saying, I'm going to take every transcription start site and look 5,000 base pairs upstream and pick your promoters in a really dumb way like that, you might instead want to say, where is there evidence that there is active gene regulatory activity? We have all of this encode data. Why don't we use it? My lab has developed a tool called Segway that classifies every region of the genome, and we have examples for human and mouse, and you can run your favorite species on it as well, into different sort of gene regulatory categories. So things like transcription start site, enhancer, you know, three prime end of gene, distal regulatory element, stuff like that. So you can use evidence from things like modification data and DNA hyper sensitivity data and feed it into Segway will produce an integrative analysis of all of these different data sets and give you one answer for every base pair in the genome, which often makes things a lot easier. So if you use the UCSC genome browser, you can easily load in Segway and some versions built in, or you can load in other versions, or at Segway.hoffmanlab.org. There's another interesting tool called Great. If you have a set of gene regions, let's just wait till the end because we're going to run out of time otherwise. So I'm just going to talk about these briefly and you can look into them more later if you're interested. You know, Great kind of helps you solve the problem of what if you have a set of regions that you know are interesting for some reason, but it's not. They aren't in the promoters of genes, right? So if you have a gene list, it's great. You can use G profile or Funk associate or even David if you must. And what if it's in the middle of energetic regions? So Great is a tool that will solve this for energetic regions and often it will do things by deciding that you should look at the upstream gene or the downstream gene and have some way of doing this to give you a go analysis. And there are a couple of runs of Great that are included in your notes and you can look at them there. So one thing that I keep mentioning as a problem that we don't consider enough perhaps is that transcription factors interact with each other, right? Often you'll find space transcription factor binding sites so you'll find, you know, transcription factor X here and then transcription factor Y just downstream and there's a tool called Spamo. So for spaced motif that you can use to analyze those sorts of things if you're in that particular space. There are a lot of different challenges in making better transcription factor binding site models. One is, as I mentioned, we have data on hundreds of transcription factors. There are, you know, more than a thousand, maybe 2,000, like we need to get data on all of the transcription factors if we're going to have a hope of understanding all of this. I'm hopeful that is coming very soon. People told me it was coming very soon like four years ago. So I'm hoping that means that all of these different groups that told me that are going to publish a paper like this year and then I will have tons of fun data to play with but who knows? They don't tell me what's going on. You know, we need to understand how genetic variation affects transcription factor binding sites, right? So if you get some, you know, you have a patient who has some sort of cancer and some just a regulatory region is affected, we want to be able to predict what sort of change in the network of transcription factor binding sites what will change in the gene regulatory network. We need a lot of work there. Another thing is these position mate matrices, they're kind of a crude model, right? And there's a lot of information like, you know, it assumes that each of the columns of the position matrix are independent from each other. When in reality they don't, it doesn't work like that, right? Like often you're much more likely to have, you know, CT then you are to have CA and it's not the same thing as multiplying the probability of C times the probability of A. And there are various models to consider that and none have really kind of attained widespread use yet but I think that's more likely to happen in the next two years or so. There's a lot of complexity to understand and that's what a lot of us are working on understanding. So if I'm to sum up here and just give you a few take-um lessons, you know, one, we have methods that do quite well at predicting transcription factor binding sites in vitro just based on the sequence and motif we've identified but you always have to remember the futility conjecture which is that sequence and motif is never going to be enough. It's never going to be enough alone and you need to combine it with other methods that will allow you to determine where real transcription factor binding might occur. And we'll talk a little bit more about specific methods in the lab. Let me pause here. Who has questions? Maybe we should take one or two questions and then if you're, you know, then we can kind of, I can chat with you guys up here. Yes, question? Yeah, you can do stuff like that. You know, since most, I don't know how much people have used safe positioning tricks models for microRNA, so that's what you're asking about. I mean, I think people usually try something a little simpler because they assume that the microRNA is, you know, mainly going to be base pairing with the other RNA. Plus or minus some degree of degeneracy. But, you know, so that's an interesting area. You know, among other areas of translational control, there are groups of people, a sort of smaller community of people that is working on what some have termed the epitranscriptome, right? So trying to analyze, you know, all the stuff I talked about here was DNA binding proteins. Well, there are people who study RNA binding proteins and some of those have specific motifs. And you can certainly use these same methods with RNA and use RepSeq instead of ChipSeq and so on. But, yeah, it's an interesting area for the future. I think I'll stick to transcription myself because it's hard enough. Other, yes, question? You were discussing like, co-express genes can have similar... Co-express genes can be because of having the similar sort of transcription factors. Then they might be entirely... can have entirely different functions. Yeah, that's true. So, I mean, you know, you have co-express genes. There can be a lot of different reasons for co-expression. And just one is the transcription, you know, having the same transcription factor binding them off, but they might not. You know, co-express... simple co-expression doesn't necessarily mean that there's a shared function. On the other hand, if you're, say, looking at a number of cell types, if you're, say, looking at more than two conditions, maybe you're looking at, you know, liver cells, muscle cells, neural cells, immune cells, right? And you find a set of genes that are only expressed in the liver. Then you have a much stronger case for some sort of liver function being involved there. So, I think, you know, a lot of these things, you know, it's not like there's some level of evidence beyond which you can say, this is, you know, co-functional and this is, you know, these share function and these don't, right? It's more of an accumulation of evidence and that, you know, once you get a certain amount, you say, yeah, this is a really solid conclusion and otherwise you don't... Is there one more? Did you still have a question or did I answer it? I just answered it. Okay. All right. Yeah. So, Segway is there. It's free. You can use the data sets around it yourself. Okay. So, do not cut into your coffee break and or networking session.