 So, welcome to the final module in this series. Congratulations on making it all. My name is Michael Hoffman. I am a principal investigator at the Princess Margaret Cancer Center in Toronto, and I'm an assistant professor in medical biophysics and computer science. And I'm going to talk to you today about gene regulation and specifically an understanding transcription factor writing. So, I understand as you've been going through this workshop, people have not been shy about asking questions. Please continue to ask questions if there's anything you would like clarified in the middle of this lecture. I also have the today's Meet thing open over here in case anyone has a question. I'd rather write down and I can get to that then. So, we're going to talk about a number of things today. We are going to learn how to understand how transcription factors bind to sequence in eukaryotic genomes and how to understand the challenges in doing this prediction accurately using computational tools. We're going to learn how to identify sites for known transcription factors and we're going to learn how to discover transcription factor binding motifs in regions like chip-seq peaks or putative promoters using Galaxy and Mewchip. That's something we'll do during the lab. So, there are several parts to this lecture. I'll talk about transcription and then we will talk about how to predict transcription factor binding sites using existing profiles. We'll talk about detecting which novel motifs are presented in regulatory regions. So, by novel, I mean ones that you don't already know in your database. We'll talk about that a little more later. We'll talk about interrogating sets of co-express genes or chip-seq regions to identify mediating transcription factors. So, let's go to part one here. So, transcription in eukaryotic cells. So, we can think of this with this very, very simple model. So, there's DNA and there is some sort of binding site in the DNA that a particular transcription factor recognizes. So, this transcription factor binds to that part of the DNA and then the transcription factor recruits RNA polymerase 2. And RNA polymerase is what will produce the RNA for a gene that is slightly downstream of where the transcription factor is. This is a fairly simple model that informs much of what we do in computational prediction of transcription factor binding. In reality, it is a little bit more complex because you usually have a number of different transcription factors. So, here's your transcription start site for a gene. There will be a number of transcription factor binding sites for each transcription start site. The transcription start site itself may not really be one base pair. People think of transcription start sites, but often in reality transcription starts will vary over hundreds of base pairs and it might start at any one of those positions. And then, not only do we have this regulatory region that's proximal to the TSS for the number of transcription factor binding sites, it's also likely that there are distal regions that affect the regulation of this particular gene. And then, of course, there's a whole downstream set of processes that can affect gene regulation such as mRNA decay, whatever, all sorts of other post-transcriptional regulation like translational regulation that we aren't even going to talk about, but just the process of pure transcriptional regulation is quite complex. And the fact that in reality none of this happens on a 1D string like this that really things happen in the three-dimensional context of the nucleus makes it even more complex. See here, these distal regions, they can affect transcription factors here or co-activators, and these terms are often used. People are not often precise sometimes when they talk about the difference between what's a co-activator, what's a transcription factor. You know, basically any sort of protein that is involved in the process of initiating transcription, people will often refer to as a transcription factor. So this all the way down here, you can think of this as being maybe tens, hundreds of thousands of base pairs away can affect the binding of transcription factors to the core region, the core promoter here. So a lot of really distant things can affect whether RNA polymerase actually binds to this region and starts transcribing RNA. The other thing that you have to remember is that DNA is compacted into chromatin. And here you can see a representation of a number of nucleosomes and the pattern of where the nucleosomes are and which regions have openly unpacked nucleosomes versus closely packed nucleosomes that prevent transcription factors from gaining access to this region can also affect things. So in reality things can end up being very, very complex. So to understand this process, there are a lot of different sorts of data that we can look at that measure various parts of it and the outputs of various parts of this process. So you probably all familiar with RNA-Seq which will give you an indication of whether a gene is transcribed and how much it's transcribed. But there are also other assays that will give you a more direct indication of how the process of transcription is going. So one of these is called CAGE, Cap Analysis of Gene Expression. And CAGE will try to find five prime mRNA caps and will tell you exactly where the transcription start site is for any particular transcript. So this is a little, if you want to find the actual start site, this is a little more exact than an RNA which once you sequence the RNA, you might get RNA from the beginning of the sequence or you might get it from the middle or you might get it from the end of the gene and stuff you get from the middle might have already been chewed up at its five prime end. So if you want to have more confidence that you're actually looking at the five prime end of the gene, then you use something like CAGE. Another thing that you can look at when you're looking at promoter regions is various epigenetic marks. So you've probably heard of these histone modifications which are now abbreviated with these very somewhat confusing short initialisms like H3K27AC. These marks, these covalent modifications of histone tails can often be indicative of different kinds of gene regulation activity, whether that's transcription initiation, transcription elongation, transcriptional repression, and so on. You can also do CHIP directly of the RNA polymerase complex. So you can get a measurement not of where the transcript is, but of where RNA polymerase is and potentially working, which is kind of a subtly different thing. So those are all things that you can use to understand promoters. Transcription factor binding sites, you can also use CHIP-C to look at as well. So for example, within the ENCODE project, you can find datasets on hundreds of transcription factors in various human cell types, and you can find locations where these transcription factors tend to occur within one cell type. Finally, these regulatory regions, which are sometimes distal from the gene, you can find them using CHIP-Seq as well or RNA-Seq. So there are epigenetic marks that tend to be present at distal enhancers, like H3K4ME1. There are also transcription factors that tend to be found at enhancers, like P300, which are sometimes called co-activators. You can measure those directly with CHIP-Seq. And also, people have found a lot of times that enhancers actually are transcribed into RNA as well, and you can find enhancer RNAs with RNA-Seq. So these are all sorts of experiments that you can do to find these different sorts of data, to find out where these phenomena are occurring. There are also things where there have been massive projects where people have collected these sorts of data. So a lot of this stuff, especially for some of the more common model organisms, like human mouse Drosophila, melanogaster, and C. elegans, you can find through the ENCODE project, or you can find through the UCSC Genome Browser, which is a place where a lot of the ENCODE project data has gone. GEO at the NCBI has started as a repository for really microarray data, and now there's quite a lot of regulatory data there in general, not just RNA-Seq data, but there's also CHIP-Seq data, and CRO-Seq data, and any other sort of something-Seq data. One problem with GEO is that it can be difficult to make assumptions about individual data sets within GEO that you can make if you get data from something like ENCODE or roadmap. Because all of the data sets within GEO have been created by different labs. You don't necessarily know what sort of quality standards people used. So, you know, when possible, I usually like to go with something that was created by a bigger project that was specifically designed for data release to the community rather than something that was kind of a byproduct of a smaller study. Yeah. Yeah, I think that's a good point. So, Francis says it's true for anything in a Virgin Bay. Yeah, I think it's even more important for something like CHIP-Seq because there is just so much more that can go wrong. I think when you're doing something like CHIP-Seq, than when you're just sequencing a GEO. So, if someone deposits something in GenBank, and it ends up having 5X174 sequence in the middle of it, it's fairly easy to see that. You can see that things have gone wrong. And it's only one sequence, whereas if you're looking at a CHIP-Seq data set, they could have used the antibody that completely failed and the whole data could be, you know, huge data set could be totally useless. So, be very cautious. But in general, yeah, community resources, I think, are better to use. So, Roadmap Epic Genomics Project is something you probably saw in a release of several papers from this project a few months ago. So, they did more than 100 different cell types. In 100 different human cell types, a lot of these are actually primary tissue rather than immortalized cell lines. Most of what they did is CHIP-Seq on histone marks, some measurements of open chromatin, some methylation data. They did relatively little of the transcription factor CHIP-Seq that we've been talking about from ENCODE. And finally, Oreginal is a website that has, so it's an open regulatory annotation that has lots of different sorts of data collected for a number of different sorts of organisms. Can people in the front hear that? Okay. I can barely hear it. I think I need my hearing checked. Oh, dear, what happened here? Okay. Well, I already mentioned the overview, so don't worry about that. So, we'll move on to part two here, which is learning how to predict transcription factor binding sites. So, quite simply, it's teaching a computer how to find them. So, let's look at a number of different ways to represent a transcription factor binding site. All right. Here's one way. You might have this sequence, and you have some data, maybe the data came from CHIP-Seq or CHIP-XO or came from some in vitro technique like protein binding microarray or cell X experiment. This is a transcription factor binding site. In most cases, knowing that this one sequence is bound by a transcription factor does not give you license to search the rest of the genome and find identical sequences and look for transcription factor binding sites that way. So, transcription factor binding is very degenerate. So, just like in the protein, the translation genetic code, there are a number of positions, usually third positions in a codon that can change and you can result in the same amino acid being added to a peptide at a certain place. There are a number of positions within any given transcription factor that are degenerate and can change without the transcription factor caring about it very much. So, just having a single sequence does not really give you that information. So, if you look at a number of binding sites, let's say you did some select experiment and you get out a number of binding sites for a particular transcription factor, you can see intuitively there seem to be some similarities between these different positions. For example, there always seems to be Tt and the fourth and fifth position. There always seems to be A in the last position. But some of the other positions change a bit. So, another way of representing this is as a single consensus sequence. So, you might be familiar with this alphabet. So, this is called the IUPAC ambiguous DNA alphabet. So, in addition to having AC, G and T, it has other letters that will represent every possible combination of AC, G and T. So, for example, you've probably seen N, which is the representation for any of AC, G and T. And there are other ones like R, which means a purine. So, R means A or G. Y means C or T. S means C or G. There are a variety of different things like that. They're kind of hard to remember initially, but they all have their kind of mnemonics. And that can represent things that what it doesn't really do is it doesn't tell you that a particular site within this motif can be A or G, but it's almost always A. Or it's like A75% of the time. Or it's usually A, and if it's actually G, that means you'll have weaker binding of that particular transcription factor. So, what we do instead is we develop a more complex mathematical model called a position frequency matrix. So you can take all of these sites here, and at every particular column, you sum up the number of times you see each one of the symbols in the unambiguous DNA alphabet. So you can see here, 14 A is in the first column, three C's, four G's, et cetera. And you can go on like that. And a position frequency matrix, this is just the simplest way of representing it without having any sort of model underlying it. There are a number of different ways you can transform it, and one way you can transform it is into this graphical sequence logo representation that gives you at a glance an understanding of which bases are most important for this transcription factor to bind, given the data you have. So I mentioned that the fourth and fifth positions are almost always T. You can see that here, that there's a T in the fourth position and a T in the fifth position, and they are some of the highest letters within the sequence logo. So that gives you an indication of how much information that you can get from a T being present at a particular position. So in practice, people who are generating models of transcription factor binding usually start with a position frequency matrix, but usually transform it in one of a number of ways. So here's one particular transformation into something called a position-specific scoring matrix, or PSSM. You'll also see similar sorts of transformations into something called a position-weight matrix, or PWM. The difference between these is not always well-defined, so really if you're writing a paper and you're talking about a position-weight matrix, it's important to say what you mean by that, what sort of assumptions were built into your model because people won't necessarily know which ones of these were done. Oh, I see, this is kind of a MAC PC issue, isn't it? This equation got a little messed up. So a variety of different ways of getting it. Selects is an in vitro selection procedure that you can use. So it used to be the mean way by which people determined transcription factor binding preferences in vitro. And the way this works is you will have a pool of random oligonucleotides, and you will have your purified transcription factor and you will use that to essentially pull down a subset of this pool. And then you will simply replicate the DNA within this pool and then do the selection procedure and do this whole thing like six to ten times or something. So it's kind of a in vitro, not natural selection, but in vitro artificial selection procedure and the sequences that are left at the end are the ones that tend to bind the transcription factor pretty well. So it introduces some biases and is kind of time consuming to do. So most of the large scale experiments people do these days that develop, that end up with transcription factor motif models come from a protein binding microarray instead where you have a chip and you put down oligonucleotides. You put down every possible like eight more basically and then you are able to measure the intensity of a protein oligo interaction all at once for the whole chip and then you can generate a PFM straight from that. Okay. So I see these are here. Can I write on this? Okay. So let me just rewrite this equation. So I'm not sure if you guys can see that, but you get the idea. Basically the conversion here is a way of modeling various sorts of factors that you should keep track of before you do a genome-wide scale. So the first thing people do is, so the F here is the original frequency. So FB of I is for a particular base at a particular column I and then you add a factor, this S of N factor that waits for the confidence of that particular pattern. So if you only have one sequence, you aren't going to have a lot of confidence that your pattern is correct. But if you had, say, hundreds of short sequences that you know represent this motif, you're going to have a lot more confidence. So you can add that into this here. And then you divide by P of B, the particular base you're talking about here, because you want to be able to score things relative to their actual frequency in the genome. So if you see C and G a lot more often in some genome than ANT, you need to keep that in mind in your model. And if you see that in saying C or G at a particular position will be less surprising, which means from an information at a point of view it has less information. So the final thing that you do is you take the log of this and that just makes all of the arithmetic a lot easier because if you put things on a log scale, then you can essentially multiply them times each other in the original scale just by adding them together. So in logarithmic space, adding, sorry, multiplication becomes adding, division becomes subtraction, and when you're working with a lot of probabilities as these are, you're going to need to do a lot of multiplication and division. And you may not realize that computers are, they seem like they can multiply any number, but they're actually a lot better at adding and subtracting than they are at multiplication and division. And if you convert things to log space, you can make things go hundreds of times faster and you can stop worrying about certain sorts of numerical problems that you get otherwise. So here we can show how you can use a particular PSSM to scan the genome. So here is the PSSM for the SP1 factor. So we have the sequence logo representation. When I see those, I think of it as equivalent to a PSSM. Here's the actual PSSM for that sequence logo. You can see at positions three and four, you can see a much bigger number for the G row. In this last position, you can see a much bigger column for the T row and so on. So this matrix can be used to score every particular set of, in this case, what is this? Timers in the genome. So here's one teamer. You can just add up all of the scores for each position, each base in the genome and the way it matches each row column combination within the PSSM. So the grayed out ones are what you're actually, the darker gray ones are what you're actually adding up here. And in the end, if you add up all these numbers, you get 13.4. So there are a number of different ways of doing this. I probably don't need to go into them. But in the end, you can convert something like this to a p-value. So you can get a p-value for any particular potential binding event between a transcription factor as modeled by a PSSM, any particular sequence. And you do that by comparing against what sort of scores you would expect if you do this across the whole genome. So the question was asked earlier about where you get these initial data. Do you need to do a C-Lex experiment yourself? Thankfully, there are a number of these in the literature. And there's also this great website called Jasper, Jasper.gene-redge.net. And you can look at Jasper, and Jasper will have a number of different transcription factor motifiles that you can just download. So they got lots for vertebrates, for drosophila, for, you know, they have a plant section, anything that they can find in the literature and they're adding to Jasper. So this is very useful for doing these sorts of experiments. Any questions so far? So now we have a model for transcription factor binding. And, you know, there's software, and we'll talk a little bit later about the software that you can use to do this sort of analysis yourself. Okay. How well does this work in practice? Well, there have been a number of experiments where people have done. So these are just a couple of examples. One experiment where these people approach and colleagues did an in vitro binding test, and they found that 96% of their predicted, of the sites they predicted computationally were actually bound according to their experiment in this in vitro test, right? Which, you know, as someone who develops predictive models, I think 96% accuracy is really, really good. And Gary Stormo found in some biochemical studies that the position weight matrices matched really well to a biophysical interpretation of how transcription factors and sequence interact. And so, you know, you can see PSSM scores, PSSM score gets higher, binding energy improves as well. So that was really reassuring. So essentially forms a biophysical basis for the working on these models. Okay. So that's the good side. That side is that, and this is just one, you know, again, one of a number of examples that find something like this. You know, if you develop a MyOD profile with in vitro data, you will find that it will predict binding sites once every 500 days pairs. And, you know, this means you'll find one about 20 sites per gene. And the ugly, if you look at the human alpha-actin gene and you use a number of different transcription factor binding site profiles, you'll find good predictions all over the gene. Okay. So literally, and you can repeat these sorts of analysis genome-wide, you know, literally the entire genome is littered with things that look like they're good transcription factor binding sites according to this model. Okay. This gives rise to what Wyeth Wasserman called the utility conjecture, which is that transcription factor binding site predictions are almost always wrong. So this model works very well in vitro, but it can work very poorly in an actual chromatin context. Okay. So this goes back to the introduction to transcription that I talked about earlier. You can have a very simple model where you just have transcription factors interacting with sequence, which actually describes very well what happens in vitro, and so the model works very well and the model can make very good predictions. But once you throw in all of that other stuff, you throw in lots and lots of sites, you throw in the possibility for long-range three-dimensional distal interactions, you throw in chromatin structure. It makes this a lot harder. What's worse is that adding stringency doesn't necessarily help. So you can't just say, okay, I know that most of these predictions are bad, so I'm only going to take the predictions with the highest possible score. Because as it turns out, in vivo, there seems to be very little relationship between the score beyond a certain level and whether a binding site actually is real. Again, this is because there are a lot of different factors that aren't incorporated into the binding sites at all. So this is kind of a sticky problem to deal with that we'll talk about a little more in a second. But first I'd like to summarize this part where we essentially bring you the foot wall of transcription factor binding prediction and then I take it away because it's too hard to do at the moment. So position-specific scoring matrices can accurately reflect in vitro binding properties of DNA binding proteins. But the binding sites are way too frequent to actually use on their own. So yeah, we can talk a little more later about some strategies to get around this. Are there any other questions on the first couple of parts before I move on? Questions? Okay. So before we get back to that, let's talk a little bit about ways that we can find transcription factor binding sites de novo. So this is a slightly different problem. But say you have a number of regions, say you have a gene list, for example, and you know for some reason that everything within that gene list is co-ordinately regulated. And you want to find whether there is some motif that occurs more likely than you would expect by chance within those particular sequences. So here we have three sequences. Here are three very similar short sequences that you find within these larger sequences. And we want to be able to find those. Specifically, we want to be able to find not just one motif, but perhaps a number of motifs. We want to be able to find the widths of those motifs so we don't necessarily know that there, say, eight mirrors to start with and the locations of the motif occurrences. So this is actually still a really hard problem. Computationally, it's nice that I can do a version of some of this stuff within a few hours today, whereas when some of these methods were first developed, it would have taken days on much smaller datasets. It's still hard because the input sequences are really long, so thousands or millions of base pairs. And the motif may be highly degenerate. So it's not a matter of simply finding sequences that you find that match each other, but you also have to deal with the fact that there are all of these possible positions that might change slightly from one position to another. And it's like finding a, you know, a thick wisp of a needle within a huge haystack. So let me make this example a little more concrete. So let's say we have a number of co-regulated genes and we're given a set of promoters. In this case, maybe we'll just take the annotations of these genes and look 500 base pairs upstream and say, okay, these are our promoters. And we want to find a transcription factor that binds to positions that are unknown to us. We don't have anything like chip-seq data, for example. And we also need to remember that they can be on either GMA strand. So for the most part, transcription factors are not really strand-specific. And so, you know, in most cases, and there are definitely exceptions to this, like there are for everything in biology, but you certainly need to deal with the fact that in most cases the transcription factor does not care which strand it's binding to and the RNA polymerase machinery does not care which strand the transcription factor is bound to. The transcription factor simply needs to be there somewhere to drive transcription. So you have to deal with the fact that, you know, we might have a motif here that's on this strand, but we might also find the reverse complement of that same motif. And the transcription factor does not see a plus strand or minus strand of DNA like we do. It just sees DNA. So yeah, this looks the same as this. So we need to find this as well. And we are going to assume that the motif of this transcription factor can be defined by a PSSM or a PWM. So our problem is to discover the motif given just these sequences. So we don't have any of that data that I showed you before, the individual sites are here, there, or wherever. We just have these big sequences and have to find this in the Haystack. So there are a couple of different techniques that people use for this sort of thing. One is called expectation maximization. I'm going to show you a simpler approach called Gibbs sampling, which is the other main technique used for this. So Gibbs sampling involves essentially guessing an initial weight matrix. So you need some sort of starting point. And the way you do this is you pick parts of the sequence at random, and I'll show you that in more detail in a second. So you get an initial weight matrix, and then you use that weight matrix to predict instances of the described motif from your weight matrix. And then you use those instances to essentially refine your weight matrix and repeat this process over and over again. So this is actually a fairly common, this sort of technique is fairly common, we used in de novo discovery of all sorts of things within computational biology where you have some sort of guess and then you refine it and then you try maybe a different guess and you refine that and you see what gives you the best score overall. It works a lot better than just doing a total random guessing. And a lot of times the search space for these processes, for these problems is so big that it's not possible to do any sort of exhaustive search. But you can find that there's some threshold by which if you make enough guesses you can reliably get a result that is fairly close to a good result most of the time. So I'll go through this in a little more detail. So this guessing step, and you don't just make up a position weight matrix entirely, you start by taking sequences, you take random sub-sequences within the set of input sequences you're given and you use those here to sum up and you use those to make a position frequency matrix and you convert that to a position-specific scoring matrix or PSSM as we've shown you before. So you do that and then you will throw away one particular instance and then you will the remaining instances will define your position weight matrix or position-specific scoring matrix and then you can use that to define a probability so scoring with your newly guessed and then refined position-specific scoring matrix at each position another input string and you can pick a new sequence according to probability distribution and in the end you return the highest scoring motif that you've seen. So another example of this so you end up throwing out one of these instances you redefine the PWM or PSSM just based on the ones that are left here this is a slightly different formulation of position weight matrix where the individual columns all add up to one point out again these things are all kind of equivalent to each other just a mathematical transformation and then using this matrix you can score the sequence that you've taken out and then you will take whatever is the best scored position within this sequence the sequence four and you'll add it back in and you'll keep repeating this process so you can do this and then in the end you'll get a Geneva motif you will get a position weight matrix you'll have a sequence logo of this thing that you think is responsible for coordinate regulation of the genes you're looking at which is great, right? So now you have a figure for your paper and in the old days I think people and when I say in the old days I've certainly done this myself you know, look at this sequence logo and then compare against your own knowledge or maybe look through Jasper at individual sequence logos and kind of scroll through it and see whether you find someone that kind of looks similar to what you found nowadays we have a nice tool that does this instead it's part of the Mewsweep which is called TomTom so after you've discovered a motif to Novo you can compare it against a motif database such as such as Jasper and it's kind of like an alignment for motifs for transcription factor motifs and it will give you a score for each potential motif in the database that this matches and you can say, okay these regions all seem to have the MIC motif you know, you can report that on your paper yes so the question was how long are these motifs? so usually people are looking at motifs between say 6 and 20 base pairs the ones that are really well defined I think are usually between 6 and 12 base pairs if you see a much longer motif like the 20 base pair ones there's usually quite a bit more degeneracy in those and I think you'll probably see some say 20 base pair long motifs within the lab often what's happening within these longer motifs is really you have what could be better characterized as two shorter motifs maybe there's a dimer of a couple of zinc fingers and it's recognizing one motif and then there's a spacing and then 10 base pairs away it's representing another 6 base pair motif yes yeah, so the question is when you're doing Genova Discovery do you end up with one one length at the end right, is that the question? so in general no yeah so I think this is something you can set when you're running most Genova Discovery motif tools is whether you're going to what sort of limitation you're going to place on the particular the length of the motif that usually you get out of something something like Meme or Dream is you will get what it thinks the best motif is and there will be some statistical correction for the length of the motif but if you have something that's truly a transcription factor that really recognizes an 8 base pair motif well if you find a 12 base pair motif instead usually that will not give you a lot of extra statistical confidence so that won't be reported so in the end these things will report the best motif really they'll report a list of the top in best motifs either the motifs that pass some certain threshold maybe it says a p value of less than 10 to the minus 4 for example or maybe you get the top 20 motifs or something like that and they're going to vary in length let us move on are there any other questions on this part? we will move on then to the next part which is getting more specifically at the question I just introduced which is to find a regulating transcription factor for sets of co-expressed genes okay so let's take some gene expression data we have here this is a really old school microarray but we might also have same sort of data from RNA-seq essentially we just have a set of expression values whether it's transform from microarray or it's rp cam from RNA-seq value for each gene and then from that we can infer a number of genes that are co-expressed and those that are negative controls and we want to be able to figure out whether the co-expressed genes have some particular motif in common so whether this say for example you have this motif you can discover whether it's here, here, here and it's not any of your negative controls that don't exhibit the same sort of co-expression co-expression phenomenon so what we can do is we can look for an element of a particular set of transcription factor binding sites and this is very similar to looking for enrichment for go terms for a set of genes except it's a lot more focused on the sequence so from a sequence and transcription factor guy like me it's a lot more satisfying in some ways so we're just looking for enrichment of a particular motif so you can follow things up from there so here are two different examples of ways you can do this one so we have this four-round set of co-regulator genes and we have this background set that aren't regulated on the same way so one is you can look for the motifs that show up in the most numbers of genes so for example most numbers of genes in your foreground are not in your background so here is a theoretical transcription factor binding site this blue motif you find an example of it in each one of the four-round genes and in none of the background genes another way you can do this is you can look instead for the most number of instances of a particular motif so you might find a motif that occurs with relative frequency in many promoters or even throughout the genome so for example you'll find it in all of your background negative control sets but you find it more often in your experimental your co-regulated genes so as I mentioned at the beginning of this talk often it can be clusters of transcription factor binding sites that can be more important than finding something one time and depending on the way a particular transcription factor works you might find one or the other of these approaches are better than in general I would be far more impressed to see something where you find a transcription factor binding site 20 times before a set of genes and one time before the rest of the genes here once and not here the other times this is something that is much more likely to occur by chance usually so there are a couple of different statistical tests that you can use you can use a binomial test to do this sort of analysis based on the number of occurrences of the transcription factor binding site, the number of individual instances or you can use a Fisher's exact test to do this based on the number of individual genes and thankfully you don't have to code this up yourself there are a variety of tools that will do this for you so one of these tools is called opossum opossum is available on the web and so you can test this out there with your own set of genes you can take a set of express genes opossum will take your gene it will take your ideas and automatically retrieve the appropriate sequence from the ensemble it will do phylogenetic footprinting optionally which means you can limit to only those positions within your transcription factor binding site which are conserved it will detect transcription factor binding sites using this sort of pssm methods I showed you earlier and then it will do statistical testing of the significance of these binding sites using either a Fisher's exact test or a binomial test and then you'll get a list of mediating transcription factors so here's one particular example of using opossum where there are a number of genes that are identified as muscle specific and a number of genes that were identified as liver specific and used opossum to see whether there are any sequences motifs for previously identified transcription factors that could be used to discriminate between these two sets and what they found here was the top gets by the binomial method so this is the method that looks at the total number of instances with any gene rather than just the total number of genes where these genes SRF MEF2 CMIB, MYTH, TF1 and so on these red arrows indicate that there's experimental evidence that these transcription factors are active within muscle anyway and over here you can see a similar example for liver you can see the top hit is the HNF 1 motif which I assume is something liver related I'm used to when I see H at the beginning of a transcription factor it usually means hepatic or hepatocellular or something, yes sorry I can't read the program so the question is what region, what sequence is this actually looking at when you upload the gene list yes it will look at the promoters of these sequences sorry of those particular genes so it will actually download them for you so if you're using opossum you don't actually need to supply the sequence you just supply the gene list so that makes it a lot simpler can you write it is so noisy back there can you designate the regions I think that you can specify how much of a promoter you can use you can also do this so this is a screenshot of the opossum web server you can also do a sequence based analysis where you just supply sequences instead so if you want to construct your own pipeline with the galaxy or on the command line or upload a set of sequences any arbitrary sequences you can definitely do that yes that's a good question is what if a gene has alternative promoters what will it do in that case so I must confess I do not actually know the answer to that question I can tell you that a lot of bioinformatics software they will try to limit to one one sequence per gene so they will pick one transcript because alternative promoters is a relatively common thing so you are doing any sort of bioinformatics analysis on promoters where you define promoters as 2,000 base pairs upstream of a gene or 500 base pairs or whatever you almost always have this problem of defining where is the actual promoter region and so usually people will do things like the longest transcript or they will pick the TSS that is most upstream and then they will use that as the region and I don't know what they do in opossums specifically but that's what I would guess if you start having to deal with multiple promoters for a particular gene the statistic certainly becomes a lot more complex any other questions oh this question oh ok so opossum has a variety of different options I usually start with the top thing here the third one here this transcription factor bunny site cluster analysis is interesting as well because it looks for close together clusters of transcription factors instead of just looking at whether they correspond within the sequence anywhere so that one can be interesting as well so the anchored versions are more for looking at particular combinations so if you think it might be not just one transcription factor but a heterodimer two acting together often though if things work a couple of transcription factors work as heterodimers often there will already be an entry within the JASPR database for that particular heterodimer a sort of heterodimer motif which is not to say that everything is in there most transcription factors are not in JASPR or Transfack yet but I'm not sure how many things that are in there that you would miss otherwise that you would only get with one of these anchored versions so if you're going to try this it's probably worth trying several different versions of this there's very little cost to it and see what you get from each of them yes now all of the others are sequence based as well the other ones just the other ones are easier because you only have to supply a gene less and the software will do things like figure out what the appropriate sequence is for a human or an officer or a fly so a lot of the appropriate subset of JASPR and maybe do maybe find conserved regions to compare against as well so the sequence specific version that just means that you can do any sequence or any species it just means that a little bit more of the pipeline will have to be constructed by you rather than relying on a lot of the stuff being done for you like in the human or a fly case and I know from looking at the attendee list there are people here who work on other species and I'm afraid this is often a common thing stuff is pipelines are already set up and annotations are made and things are done for you and the human world maybe they aren't as often if you're say working on a common organism like zebrafish let alone something that is studied by a smaller community other questions so here's a interesting interesting thing which is that there are transcription factors that are structurally similar so there are families of transcription factors like the ETS transcription factors where the proteins bind to highly similar motifs so there are exceptions like the zinc finger family where zinc finger family is renowned for its diversity and there are a couple of different residues that can change and greatly change the binding motif but in other cases you'll find a so this is the ETS binding motif as in Jasper and there are a variety of different matrices or sequence logos for other ETS proteins here and they're all very similar even though they come from slightly different proteins so the the question is you know when you have this huge family what can you do on the opossum website there is a or nearby on the opossum website there is software called Chopgene that will help you pick out which one of these it may be responsible but in reality I'm not sure how much it matters in most cases if you found that something is related to ETS that's what you're going to know about it and it's going to be hard to figure out which particular submember of the ETS family is being down here without actually without actually doing an experiment yourself so that ends this fourth part of this lecture we've been going through this pretty quickly any questions before I go on to the last part of the lecture okay we'll move on to part 5 and this gets back to be the problem I identified earlier which is that you can take all of these transcription factor binding predictions and bulk genomic sequence and it doesn't necessarily mean something so there are a variety of different ways to try to understand the context the genomic context of a particular binding site we want to do an integrated analysis that includes not just the sequence that might indicate some transcription factor binding but also utilize other data we have such as epigenomic marks or chromatin accessibility data as determined by DNIC or TAXIC and there are a variety of methods that you can use to do this I'll talk about one segue which is near and dear to my heart because it came out of my lab so segue will partition the whole genome into a number of different classes so we'll use data from chip-seq experiments from encoded or roadmap or something similar and DNA-seq or TAXIC or first-seq experiments and then it will make a decision for every region of the genome of what kind of regulatory function it has and so it will call a particular region as maybe having regulatory activity or will call a region as having as being transcriptionally request or to call a region as you know being a distal insulator or something like that and with this sort of information you can focus on particular regions when you're doing transcription factor binding site analysis and then you'll only be limiting to those regions that might possibly have a transcription factor binding to them and that might actually have a biological effect it's unlikely to happen in regions that you can call as heterochromatin from a variety of chip-seq type data sets so that's one tool that you can use for this sort of thing yeah the question is which order do you do this in so in general, segue is something that will and many of these tools will limit the part of the genome that you need to do your analysis on so you would do that first a nice thing about something like segue is usually it will already be done for you and you can just download these annotations so you can go to the segue website and you can download a segue annotation for you know liver cells or you can download one for my lid leukemia cells or so on and then only focus on those smaller regions of the genome when you're doing a particular motif scanning analysis yes are the annotations in something like flinvar or anivar no so these are a little different than something like anivar so first they cover much bigger proportions of the genome so as I understand that anivar will have a call for various positions in the genome and say this has been this has been described as causing this particular phenotype segue will divide the entire genome up into regions and every region in the genome will be called as either repressed or transcriptional or dead or something like that um and yeah but the other thing about it is you know based on these sorts of chip seek experiments and it's an inference that's made from all of those different experiments and previous knowledge about the way chromosome works so it's much less tied to individual stories at individual positions so flinvar anivar are much more as I understand them point related and this is something that attempts to describe everything so another interesting tool is is called GREAT which is from the beharano lab at stanford so GREAT is something that you can use to to tie a set of regulatory annotations you have so you know that a particular set of regions is important to understanding some biological problem but you don't necessarily have them all tied to particular genes so one thing you could do is you could do some sort of motif search in these regions another thing you could do is you could put them into GREAT and do a GREAT analysis and GREAT will essentially assign each region to a nearby gene which isn't by all means is not a perfect solution to this problem but for now it's often the best sort of thing we have so in GREAT you can take a bed file indicating a particular set of non-coding regions and they don't have to be non-coding but this it is designed to work with non-coding sequence so if you have some coding sequence and that's fine but you wouldn't want to use this if you could mainly identify genes from your sequences to start with and it will look at various sorts of attributes that you can associate with particular positions here and give you there's a website that will give you this sort of of output and you'll get a list of various sorts of properties that are enriched in the sequences that you have specified so those can be things like gene ontology terms and there are also a variety of other properties that are also that are also in GREAT so the question is for a more specific example of what you can use GREAT with so what's that? no you're starting to put a view of that some data yeah I'm trying to think of a good example sort of for example you are chasing after some particular set of of energetic RNAs that seem to pop up under a certain set of conditions so you're able to map these RNAs back to the genome they're all long coding and energetic and you want to see whether the genes that they are nearby have some particular thing in common so are they genes that have to do with a certain biological process genes that tend to be associated with a certain sort of tissue that sort of thing that's what you can scrape for so the question is how does it do this because you're providing sequence so what you're providing you aren't providing a list of genes so oh no okay can you should I just wait I can shout but if this won't be picked up on the so let me answer your first question first so the first question is you know how does this work when you upload a set of sequences instead of a gene list so you aren't actually uploading a set of sequences you're using a bed file which has a set of coordinates related to some genome okay so in some ways this is distinguishable from a set of sequences because those coordinates you uniquely define a set of sequences but importantly they're related to that particular particular genome okay so those set of sequences basically what Grave does is it will take a set of sequences for energetic reasons and it will create a gene list and then it will do some sort of analysis on the gene list so this means that your bed file of these coordinates it has to be in one of the the genomes that Grave already understands and has annotation for you know you can't there's no sequence based code that I know of at least for for Grave because all of the the data has to be assembled for a particular for a particular species one of the nice things about transcription factor binding analysis outside of this is that you know often you can learn if you learn a transcription factor on mouse it's probably applicable to human as well you know it's probably applicable to most other vertebrates these transcription factor motifs are usually very well conserved whereas this sort of thing you wouldn't be able to do it as much and your question was whether you could use this for any particular RNA RNA seek analysis I think it would depend greatly on what exactly you found from an RNA seek analysis so one where you're attending to find protein coding genes this probably wouldn't be the best tool for it probably better to use the sorts of tools that were described earlier in this workshop but if it is something that gives you predominantly non-coding alignments and you're kind of struggling to figure out you know how do I get a gene list from this basically you know when you're in a position where you have a set of sequences you have a set of regions and you say I don't know how you know I want to analyze these as if I had a gene list but I can't get a gene list from this that's essentially what great is for it it will kind of make the gene list for you and then do the analysis was there another question yes so if you find a gene list or a bed file it always like go back to the same analysis from the same database great either okay so the question is if you supply a gene list or a set of sequences or coordinates will great go back and get things from the database great well you know great well there's always a database tied to a great analysis so you need to supply some sort of set of sequences or coordinates that are related to one of the existing great databases Segway is both software and it's also a set of pre-computed annotations available for that software that have been produced by that software so most of the Segway annotations that have been pre-computed for you from human data okay and there's also a worm there's a worm in a fly cross species Segway so that's available as well Segway is not like a web server in the same same way as say great or G profile or one of these other tools you know what you get what you get is an annotation file which essentially is a bed file that you can then import into Galaxy and do all sorts of other analyses with so it's a little different but in general it's going to be something that you'll use as input to another one of these methods or even just to display on the genome browser rather than something that you can upload particular sequences to and it will work with yeah always going for the same dataset if you provide the same bed file or same genus but if you have some like I'm efficient in your disease group then you will miss you will miss because you always do the same synthesis you're asking what the effect of mutations in your in your initial data center for example if you get some into the bed file regions that in your case and control then you go back to look the motif different motif maybe there are some mutations caused by different binding sites but since you already you don't use that information that you won't go out from other so the question is what happens when you're when there are say point mutations within this the sample that you're using for input what will happen with GRADE and some of these other tools did I get that right? so GRADE GRADE will not consider that because GRADE mainly uses annotations that are tied to particular genes much like G profile or any of these other gene less tools if you use something like opossum or like mean chip it will consider that because there you're going to be supplying a list of sequences rather than you know a list of coordinates or genes and opossum or mean chip will work directly with those sequences so if you actually see differences in your sequence because of a mutation that will be picked up by mean chip so questions, okay yeah so let me just mention another couple of things and we'll finish up so one is interactions between transcription factors so you ask the question about you know what length are these motifs usually you know I mentioned that often you'll find heterodimers you know so in addition to the heterodimers that you'll find sort of pre-existing in motif databases it's also possible to discover these de novo within your data set so the meme suite which we'll talk about a little in the lab has a tool called SPAMO and SPAMO since for spacing of motifs is a tool that you can use to find whether there are two motifs that tend to be spaced by a similar amount each time right so if you can imagine two transcription factors each with their own motif and in some particular set of genes they are always configured so that one is say you know binds eight base pairs downstream of the other this is the sort of thing that you can pick up with a tool like SPAMO and we're running short on time so I won't talk about any more detail here so I want to wrap wrap up here now there are a few few big challenges ahead in developing transcription factor binding site prediction further one is the question of understanding all transcription factors so they're and say humans there are 1400 to 2000 transcription factors we have motifs for several hundred of them it's not enough to get a complete picture of everything so we need a lot more biophysical data to understand all of these things two we want to understand how genetic variation in regions of transcription factor binding sites affect transcription factor binding third a really important question is how to best integrate all of the epigenomic data that we've been gathering over the last decade or so and fourth is you know whether we should be using more sophisticated models for transcription factor binding in the first place you know those models work very well for naked DNA but there's a you know there are a lot of things that they still don't represent they treat each column as if they're independent probabilistically when we know they really aren't so in the end there's a lot of complexity within the eukaryotic nucleus and there's a lot of things we have to to understand we're getting large scale data on a lot of these different aspects of of chromatin biology so it's becoming more and more possible to understand this in a way it wasn't say even you know three or five years ago but we still have you know as scientists who want to understand how transcription factor binding works we have a long road ahead of us so I just want to sum up you know mention a few few take home points from the lecture you know one is even though we have good models for transcription factor binding in vitro we have to remember the futility conjecture which is that you know those predictions in and of themselves are not all those useful and that we need to use other techniques such as conservation such as you know epigenomic data that can be summarized with something like Segway such as looking for clusters of transcription factor binding sites to find those that might be biological meaningful. Second is that looking for transcription factor binding site enrichment can be used to help us figure out what is responsible for patterns of co-expression and third is that we are going to need to in the future use a lot more epigenomic data in order to solve all of these problems and there are methods being developed that will do this but it's still a long road ahead of us so that concludes this lecture I'm happy to answer any more questions if you want to ask questions during the break that's fine too