 Okay, so I'm going to allow Michael to introduce himself and then we'll turn this forward with Michael. All right. Hi, everyone. I'm Michael Hoffman. I'm a principal investigator at the Princess Margaret Cancer Center here in Toronto, and I'm also an assistant professor in the Department of Medical Biophysics at the University of Toronto. So today I'm going to talk to you guys about analysis of transcription factor binding sites, predicting transcription factor binding sites, and a lot of things around transcription factor binding motif. So my background in this area, I did a lot of work on transcription factor binding prediction during my PhD and during my postdoc, I spent most of my postdoc looking at epigenomic data sets like chip-seq, RNA-seq, things like that, and transcription factor binding is again becoming a focus of my lab. Bear with me for one second. We're going to talk about several things here today. First, I'm going to give you an overview of eukaryotic transcription. Then we're going to talk about how to predict transcription factor binding sites using binding profiles. Then we'll talk about detecting novel transcription factor binding sites and how to identify transcription factor binding sites in sets of co-express genes or chip regions that you've talked about before. And finally, we'll talk a little bit about gene regulatory network. If any of you have any questions at any point during this lecture, please feel free to raise your hand and we can get right to it right there. Starting with transcription eukaryotes. This is a very simplified, nothing is showing up for you guys. Okay. There we go. This is the overview. It's also in your binder. There you go. So, starting with transcription, so this is a very oversimplified view of what transcription is. So, it's a very simple model. This is a three-step process where transcription factor binds to a binding site. The transcription factor catalyzed recruitment of the RNA polymerase II complex and that causes the production of RNA from the transcription start site. All right. So, here is a slightly more complex version of this, so describing all of the parts that go into transcription. So, you can see here the transcription start site that you probably have all heard of and loved before. So, as it turns out, this idea of a transcription start site can sometimes be a misnomer. Sometimes it can really be a transcription start region. But for certain sorts of genes, often the transcription start site will vary over hundreds of base pairs upstream or downstream of where you have a TSS marked in an annotation. The TSS or the transcription start region begins the first axon of a protein coding gene and you have other axons that will be downstream of it. And the other things that are important to consider are all these regulatory regions. So, there's this core promoter slash nipiator region, which is very important, and many genes this contains the so-called tautobox, which is bound by tautobinding protein or TBP. Now for a lot of genes, this is a very convenient mechanism to describe your carotiduct transcription initiation, but it turns out a good proportion of genes do not actually have a tautobox and use a different mechanism to initiate transcription. There are also various regulatory regions. So, here we have a proximal regulatory region, which consists of a number of transcription factor binding sites. And you can also have distal regulatory regions. So they can either be in chronic regulatory regions, but sometimes between the first and second axon, you can actually have regulatory region transcription factor binding sites that affect the transcription levels of this particular gene. You can also have them at a much more distal range as well. So you can get enhancers, and enhancers can either be in intergenic regions far away. They can be in intergenic regions a little closer. They can be in the intron, either of this gene or of some other gene. So the important thing to consider is that we usually analyze all of this stuff with sort of a one-dimensional map. But really, what's going on inside the cell is something that's three-dimensional. And you will find that the chromatin loops around itself in three dimensions, and that means that something that seems really distal, like this enhancer right here, can actually be pretty close in three-dimensional space to where the promoter is. And these things can be linked together by a number of proteins, by co-activators, by other transcription factors, and so on and so forth. So it can get really complex. There's a lot of data that we can use to try to decipher this complexity. So the first sort of data that's interesting is there's data on where the 5-prime ends of the genes are. So there's cage data, which stands for cap analysis gene expression. Do any of you guys see the Phantom 5 results within the last couple of months? So this is an effort by Rican, which is a Japanese research center to do lots and lots of cage analysis of literally hundreds of human cell types. If you start finding yourself interested in cage data, that's a good resource for this sort of thing. So there's also a lot of chip-seq data. So chip-seq data can generally fall into one of two categories. It can tell you where epigenetic marks are, so things like histone modifications. And it can also tell you where, in this case, where RNA polymerase 2 itself is located. It can tell you both that as well as where transcription factors are located in transcription factor binding sites and where various proteins, like transcription factors, co-activators, and so on, are found in regulatory regions as well. All right, so in order to get at this laboratory data, there are a variety of resources that are very useful. Probably the most helpful one is the UCSC Genome Browser. You guys, how many here have used the UCSC Genome Browser before? Okay, lots of people haven't used it as well. So during the lab, you'll get some experience with downloading files from the Genome Browser. There's also GEO, which is a project of NCBI. And GEO forms an official depository for a lot of regulation and expression data of this type. It's not always super easy to use, unfortunately. So I think it's really important that when people publish papers, their data be deposited in GEO. But often, if there's another way of getting at that data, you will be better off using that means. Like if the PIs have a website or if they're a part of something for them that has some other website. So I'll give two examples of that. One is the ENCODE Project, which was a huge effort to try to figure out as much as we could about which transcription factors occurred at different places in the human genome and different cell types. Also in the mouse genome, and there's also ModEncode, which did the same thing for Drisophila and for C. elegans. The ENCODE Project, all of the data is deposited in GEO. But if you actually want to use any of it, you're much better off going to encoproject.org, which will in fact redirect you to the UCSC Genome Browser for certain things. Roadmap epigenomics is a similar project in some ways to ENCODE, except that it focuses on primary tissues rather than tractable cancer cell lines. And also, oregano is an interesting resource that has lots of information about what sort of regulatory elements you can find in various places. So to sum up this part of the... There was a summary here. No, there's not. So move on here to the second part in a second. Does anyone have any questions about the first part of this lecture? Fairly straightforward stuff. So part two, the Camtasia thing drives me nuts. So part two is a prediction of transcription factor by new sites. So, you know, we know where transcription factor binding sites are in some cases, but how can we teach a computer how to find more of them? So there are a variety of ways of representing a transcription factor binding site. The one is we might have a single site and we've identified with CHIP or we've done some sort of other more specific assay that a single site is somewhere where a transcription factor bind. So let's say we found the single site A-A-G-T-T-A-T-G-A. The thing about transcription factor binding is that it is extremely degenerate. So you aren't, you know, you're very rarely going to have a transcription factor that binds to a single set string of nucleotides and nothing else. So there will be a little variation allowed at various places in the transcription factor binding sites. So here's an example right here, which is that you have all of these, all of these different binding sites. And you can see some amount of similarity. For example, there's an often A in the second column. There's almost always a T in the fourth column and the fifth column and so on. You can represent this by a set of sites using a special, what's called the IUPAC ambiguous DNA alphabet. Have you guys, have you guys seen these sorts of characters used to describe DNA before? Yes, everyone? Okay. So even if you don't think you have, you know, you probably have in that, you know, in as part of this alphabet that means A, C, G, or T. And you can have things like R, which mean A or G or Y, which mean C or T and so on. There's a whole set of codes that are possible there. That is also a bit of a blunt instrument, you know, because really at some point, you're forced to say, you know, for a particular column in this sort of analysis, does this happen all the time or, you know, is it split 50-50? Well, what if you have a case where almost all of the time it's A, a few times it's G and a few times it's C. You really want to represent that as B, which is the code for not T. There's more information in there and we want to be able to represent that as well. So representation people go to instead is called a position frequency matrix. In this case, here's the PFM that people have made from this set of binding sites. And you can quite simply, for each column, add up the number of times you see each one of these characters. You find A 14 times, 3 times, 4 times here, et cetera, and so on. And by using a PFM, you can find, you can convert a PFM into something like this sequence logo, which I imagine some of you have seen as well. So this sequence logo is a representation of the transcription factor binding site that includes information about, you know, which sites are, you know, more constrained or less constrained, which sites permit some amount of variation and so on. The way we actually represent these PFMs for computational tools that will identify transcription factor binding sites is with a slight change. So we convert them to what's known as a position-specific scoring matrix, the PSSM, or what people call these a position weight matrix, or PWM instead. So to get one of those, you start with the PFM and you have to correct for the nucleotide frequencies in the genome. If you can do that in this case, we'll start with the PFM, which is this FDI quantity right here. So for every cell in this matrix, we've got an F, the B is the base, the I is the column, and we can take that base and each base has a background frequency in the genome. So if you're working in a species that has a really high GC content, your background frequencies are going to be different. You're going to find many more season G's and fewer A's and T's. If you're working with a very AT-rich genome sequence, then you'll find the opposite. So having an effective background model can be very important if you're trying to discriminate between the sorts of things that have been established by evolution and the sorts of things that are just likely to occur by chance. Finally, we have a weight for the confidence or the depth and the pattern. And the way we do this here is by adding a pseudo count. You guys see all of these zeros in the matrix. Who knows what happens when you try to take log zero? If you try to take log zero, you get negative infinity, which won't work very well here. So the most common way of dealing with this is by adding something called a pseudo count. We actually add one to everything we have here, or you can add different numbers of pseudo counts. What the pseudo count will do is it will form essentially a prior on your model of how transcription factors bind to particular reasons of DNA. So if you have a prior that says that everything is one, that you add one pseudo count. And that's a very weak prior that says that if you have more evidence, so if I have that one added everywhere, and then I have 50 counts, they're totally going to overwhelm the one. But if I have a very small number of counts, if I'm only making mis-imprints about the transcription factor binding site motif from, say, five examples like we are here, it'll make a much bigger difference, because one sixth of the count in each column is going to be more than one sixth, so four out of nine accounts in the column are going to come from the pseudo count. So it's really how you set up the pseudo count versus how you take the actual data that can affect in a big way how stringent your model is and how willing it will be to accept any sort of variation or degeneracy in the model. And the final thing that's done is that we take the that we take the log of this probability. So this is kind of an optional step. You will find a lot of descriptions of position weight matrices in the literature where people don't actually do this. So they just do the frequency and divide it by the background frequency. And as long as things are probabilities, then things work well as the PWM. When people say PSSM, usually they mean something that's been log transformed. And the nice thing about log transformation is that when you're scoring things, you just have to add them up, and it's easier for the computer, you don't have to do multiplication. Yes. Okay. So depending here, so that's a good question. So the important thing to consider is this, this S represents the pseudo count. And in most cases, that's going to be a constant across the whole matrix. So it's not really, you know, it's less like a weighting of this. And it's more like a de-weighting of that. So the bigger the amount of S that you add in, the less you care about the actual data that is represented in an F. Does that answer your question? Yes. Okay. All right. So in the end, we have this very nice PSSM, and you can see how you can score a particular DNA sequence. So here we're going to score the sequence TGCTG. All you have to do is add up minus 1.7, 1.0, 0.5, minus 0.2, and 1.3, and you get 0.9. So if we did all of this in the linear space instead of the log space, you would have had to multiply a number of fractions by each other. I don't think it's something that anyone, you know, I certainly would not be able to do it in my head. And the thing to remember is even though computers are fast at multiplying, they're even better at adding than they are multiplying, just like us. So it's helpful for computational reasons. So I can show here an example of a particular transcription factor binding site. This is the motif for the SP1 transcription factor. The sequence logo looks like this. And here you can see the sorts of, for real transcription factor binding site, what the PSSM will look like. So look here in the third column, we have this really big G with almost two bits of information. What you see is that if there's a G in a particular sequence, you add two to the score, and if there's anything else, you can track 1.5, you see the same sort of thing over here. Whereas if you go to column 8, G is still the score, you get 1.5 for that, but other things do not improve the score as much. They certainly won't decrease the score as much as you have here. So what's in gray in this particular panel shows the score that will be produced by this sequence in the blue box using the SP1 and the PSSM. You get absolute score of 13.4. There are other ways of doing the score. So one is you can do a relative score where you take the maximum possible score and you use those to set up essentially a model of what the best and worst possible case for a sequence binding to that transcription factor is. And then you take the absolute raw score, minus the min score, divided by best end, you get a percentage in the end. So you can see that this particular sequence has an absolute raw score, minus the min score, you can see that this particular sequence has a 93% relative score. You can also use methods to convert this sort of thing into a p-value, which can also be very useful, people often like looking at this sort of thing. To do that, you'll look at all of the possible scores and just see what percentage have a higher score than what you've actually determined through a raw score. So a lot of the work in making transcription factor binding site leafs has already been done for you. So there's a great database called Jasper, which you can find at this URL, jasper.genereg.net. The people in Jasper have collected data from publications, from direct submission, from looking at chip seek data sets, and have generated this huge database of PSMs and PSSM. And you can download those and use them for your own analyses. Or if you're using things like Meme, it will usually be like a drop-down. If you're using like Meme on the web, you can use the drop-down option to just pick the Jasper motifs. And so it's really easy to scan your stuff against, you know, your sequences against these known motifs, and you can identify potential binding sites that way. All right, let's talk about a few of the conclusions we should make when looking at the history of transcription factor binding prediction and some experimental validation. So in the late 90s, I think people have been developing models like this since the 80s. I think one of the first papers I knew of is from 1986. In the 90s, people said, let's test some of these models. You know, they were getting more high throughput techniques than they had been. Certainly not the high throughput techniques of 2014, but they were able to do some tests within the laboratory. And for example, in this paper, Trunch found that binding sites predicted with these sorts of methods, 96% of the binding sites were bound. And Gary Stormo and colleagues found that transcription factor binding matrices, these position weight matrices fit very well with a biophysical model of how proteins and DNA interact. So it seems like, you know, these models are very good at finding these sites. If you have a site, a real site, there's a very good percentage, very good chance that one of these models could predict it. The bad is that there are also lots of other sites that these models will predict. So you can find that some profiles will predict that their transcription factor binding sites, you know, every 500 base pairs of sequence or so. You know, this will depend a lot on the information content on the motif. So how tall those, in the sequence logo, how tall all the letters are, how many columns there, et cetera. But for some simple motifs, you can find stuff like this. So you might find 20 sites per gene. You know, is this really realistic? Perhaps not. And here is, you know, a case where you looked at a set of profiles on the human alpha-acid gene, and you find that there are transcription factor binding sites all over the place. If this were representative of reality, that would mean that basically the entire length of the gene and its promoter region is totally covered by transcription factors all the time. No room for any of the, you know, RNA polymerase machinery to get in there at all. And really what this means is that this stuff alone does not have the predictive value that we need. And you can't. So this is something that Wyeth Wasserman proposed, this so-called futility conjecture. So this is Alvin Sandalan and Wyeth Wasserman 2004. They call it the futility theorem back then. The futility conjecture is probably better named for it. So, you know, where you have a binding site transcription factor, binding site prediction will be right, but most of the transcription factor binding site predictions do not represent sites of biological relevance. And increasing the stringency at which you apply these models does not help. So it's not the case that we are just, you know, we aren't being strict enough about what we accept at the binding site. If you start increasing the stringency, you're going to start throwing out a lot of the real binding sites as well. You're going to start throwing out the baby with the backwater, so to speak. So the problem is that there's some more data that is essentially missing. There are other things that are going to affect whether transcription factor is really binding someplace or not than just a simple model of which sequences exist there. So let's sum up for the second part of this model. So position weight matrices or PSSMs accurately reflect the in vitro binding properties of DNA binding proteins. But these suitable binding sites occur far too often for us to make this a useful prediction of in vivo function. So we have to go to other methods to help us out. Are there any questions about this part of the model? All right, you guys are good. So one thing that we can do, move on to part three, is we can talk. We can look at particular regions that we've already identified as regulatory. So I'm going to start by looking at discovery of transcription factor binding sites, discovery of novel binding sites, and then later we'll go into identifying binding sites that we already know about. So this is what we'll call the motif discovery problem. Let's say you have a number of genes that you know are co-regulated. So you know, or a number of sequences that you know have the same regulatory pathway. So you know this either from looking at gene expression data, so RNA-seeker microarray, and you find that some things are well correlated together, or you know it by looking at chip-seq, you know that a bunch of regions are actually bound by the same transcription factor. The motif discovery problem is the problem of determining which motif, in this case which PWM or PSSM, will represent a binding site that is common to many of these sites. And in the previous example I gave you, right here, so here it's really easy to generate the PWM, right? Everything's all nicely aligned in that column. The problem in this case is that even though there's a common motif, they aren't going to be in the same column. So here we have sequence one, you know we have these three sites that look somewhat similar. These look like they might represent the same motif, but they all start at different sites within their region. So we need to come up with another method of dealing with this. And there are another couple of problems to deal with. So the first problem is that the input sequences are really long. So even this example where I limited things to about a thousand base pairs is maybe not necessarily realistic, especially considering that things can be affected by digital regulatory elements that are far away. And you'll have many thousands of examples of cases for each one of these motif discovery tasks. And also the motif can be very subtle. Instances can be short. And again, the transcription factor binding sites can be generalized. Let's do a brief example here. So we'll start with a set of promoter from co-regulated genes, right? So here are genes right here. We've already flipped everything so that, you know, in reality, if you pick the average six genes, we'll be on the plus strand, some of them on the minus strand. We're already looking at things in the context of the genes template strand. So here are set of promoters. So arbitrarily, we've decided we're going to look at 100 base pairs upstream of the DSS in these genes. You know, if you were doing this as a real analysis, you would probably want to do something at least 500 base pairs. Usually people will do 2,000. Sometimes people will do 5,000. There's a sort of good rule of thumb when people are trying to do a very quick attempt at defining what a promoter is. So there's a transcription factor binding that binds to some positions that we don't know. And even though the genes are all arranged here on one strand, the transcription factor binding sites can actually bind on either strand. So again, you know, we think of all of this as, or at least I think of all of this as information with an orientation, but within the cell, that's not really how things are. The nucleus is a three-dimensional machine, and if a transcription factor protein is one way or the other binding to one side of the major group or the other, it doesn't really care, and it's probably still going to be able to connect to other proteins in the transcription initiation machinery and make things happen no matter which way it's going. We have to take the strand into account as well. So here's the case. You've identified here, and I'll show you how we actually did this in a minute, but this is just showing you what we have to find. So here's an instance of the binding site. Here's another instance of the binding site. Here's an instance of the binding site in the reverse direction. So, you know, it starts with, here we had one start with A, A, A, G. This one starts with T, T, C, A, A, C, T. So it's the, you know, use the Watson-Crystal-Base-Ferrine-Rolls and reverse things. All right. How are we going to come up with a PSSM that describes this? All right. Here is one example. This is the simplest way I know of doing this sort of thing is to use Gibbs sampling. Gibbs sampling uses an approach that's quite common in machine learning, which is an alternating approach. So somehow you start with, you're trying to come up with some sort of set of parameters, right? You're trying to come up with parameters for a model. In this case, your model is which binding sites, which sequences, just your transcription factor have a penalty for. So we have a model, which is the PSSM. We want to come up with parameters, which are the values that go into the PSSM. So a simple way to do this is just to guess the parameters and to use those parameters to score the sequences that you have and see what gets you the highest score. So this is even simpler than Gibbs sampling. We can just start with one and then two and does that give you a good score? We'll start with another random matrix. You want step one and then step two. Does that give you a good score? You can do that thousands or millions of times and maybe you will luck out and you will find something that actually gives you a good score for this particular region. Gibbs sampling does something that's a little bit more sophisticated where instead of just regassing at the beginning, after you predict the instances and the input sequences, you then have a set of aligned binding sites and then you can use those to predict a new weight matrix using the original PWM formula I showed you before. So it's a simple case of starting, guessing at random and then looking at the results, assuming you got the right results, refining your model and then repeating the process again. So identifying results with the new refined model and then using that to refine the model once more and then you repeat the process over and over again. And this works. This is a much more effective use of computation than completely just guessing and starting from scratch each time. So I'll show you this, what I just described in some detail here. So we will guess an instance from each one of the input sequences. So just totally, totally random. We pick this, this, this and this. And then we can, for the search part of this, we can throw away one of the instances. So we picked five sequences here. We'll throw away sequence number four, for example. And then we use the weight matrix to define instant probability at each position of that same input string. And we'll pick a new, a new region of the sequence for our next round. And we'll go through all of the individual sequences here and we'll repeat the whole process over and over again. So here's this explained in yet another way. So here we've picked at random sequences one, two, three and five, we've picked four too, but we've thrown it out here. All right, we feed that in to the general PSSM formula to produce this PSSM. And then we look at sequence four. We score all of the possible sequences, all of the possible binding sites in sequence four and then whatever is best, we feed that back into this this starting matrix here. All right, so this is how the Gibbs sampler works. A lot of what people use today is a software package called Meme, which uses something that is similar to the Gibbs sampler except that it uses something called expectation maximization. It still uses a sort of alternating approach. The difference is that instead of picking things at random, they use a method that allows them to try every possible matrix that you could get out of a particular set of sequences. So Meme will give you the same answer every time you run it and it can give you results that are as good as good sampling. Gibbs sampling could give you better results, but you will never know when you're done with Gibbs sampling, but you can keep running Gibbs sampling forever. You can always start with a new initial matrix and try something else. So Meme can be much more satisfying in the aspect of having something where you know what the answer is going to be and you get the same answer each time. How are people doing with this? That was probably some of the hairiest stuff we'll talk about in this particular module that people have questions. The question is whether you have a transcription factor and it will have some cooperative partner that might allosterically find the transcription factor and change how it binds to DNA. Is that the question? That's definitely a possibility. So the way you deal with that in this sort of PWM model is by having different PWMs for the cases where the transcription factor is affected by some other co-activator. So let me, I think we'll use the jumping off project, jumping off point here. So here is the JASPR database that I told you about before and we can look at all of the different matrices that have been curated into JASPR. So you find a number in this one, pretty much any welding line transcription can be included here. But you also find things like this. So this guy, DDIT3, DVPA, those are two transcription factors and the two colons is a way of saying that this is the position weight matrix that is associated with these two working together. Now do we know from this like is CBPA or DDIT3 which one is actually contacting the DNA? No, we don't know. From the purpose of just trying to model transcription factor binding sites it doesn't really matter. It's convenient to think of them as a single entity. Now this is a really crude way of doing this. So you can think of models that might be more sophisticated that would include like a PWM where you have like an extra variable and you use some sort of graphical model to control different probabilities being emitted but someone would have to develop all of that stuff and it would be kind of a fun project for someone who's interested in the computer science of computational biology but in the end you don't know whether you would get results that are actually more biologically useful or not. Did you have a question also? Oh, that's right. Like I said, this sort of alternating approach is very common in machine learning and that problem when you stop sampling is equally common and you basically have to define some sort of arbitrary stopping point. So in this case what you might do is you might keep repeating this process until your score stops increasing. So you start with a new random set and you go through all of this and at a certain point the GIP sampler will stop increasing the score by very much and at a certain point if you throw out the results and you start over again and keep sampling on a number of new initial matrices they won't outperform your previous score. So if you do something like you try a thousand new matrices after you start one that if you found one with a score of a certain threshold then you start a thousand new matrices and if you haven't found any that perform better after a thousand then you're good. Or you can do a million. There's always more to do though which is kind of a problem with these sorts of methods. They don't guarantee a closed form solution. Any other questions? Those are really good questions. Things that people who write these methods perhaps think about at the beginning of the method rather than at the end. Alright, so once you've discovered a matrix so you get out of running something like Meme or GIP sampler or Dream or something similar you will get a motif. And often we have so many matrices in Jasper so many motifs that have been identified already it can be really helpful to ask the question does my discovered motif match up with something that's already in Jasper or in TransFat? And you can use a tool called TomTom which is part of the Meme Suite to do exactly that. It's very simple. You paste in a PSSM it compares against Jasper or TransFat and it gives you an alignment not of sequences like BLAST but instead an alignment of one motif to another. It is rather similar to BLAST in that you get a P value, you get an E value things are corrected for multiple comparisons. And this can be useful for two reasons. One, you might have found a new another instance of a transcription factor that's already known but you might have found a novel transcription factor but it's something that's similar to one of the transcription factors people already know about. It's something that's quite common because these transcription factors will arise an evolution via process of duplication and substitution and so you'll end up with something that even if the amino acid sequence of the transcription factors are very different they will have a very similar recognition motif. All right, so that is the end of part three of this. So part four is where we talked about inferring which transcription factors regulate sets of genes. So for this we can start with information we have from some sort of gene expression experiment. All right, so here we have microarray it can be RNA-seq as well or instead as a computational biologist when I look at something like this what it represents to me is knowledge of which genes are co-expressed. So which genes do you find that have high expression or correlated expression within the same cellular environment? And you also get a set of negative controls that do not have this same sort of expression. All right, and we have a particular motif that we want to use to interrogate these regions of co-expressed and negative controls. Co-expressed and negative controls and we need to get a method that will say this particular binding site is present at all of these co-expressed genes and you don't find it in any of the negative controls. So there are a couple of ways you can do this. You can do this either by doing a de novo discovery of what's occurring in the co-expressed versus the negative controls which is what's done by dream. But a simpler way is to look at the set of binding motifs that we already have in Jasper and to scan both of these sets with Jasper using software like FEMA. So this is, you know, in some ways very similar to looking for go-term enrichment. Did you guys talk about go-term enrichment earlier today? Okay, but similar to that, but instead of looking for a particular go-term, you are looking for a particular transcription factor binding motif that occurs over and over again. There are a couple of different ways to measure enrichment. So here's one way. Looking at a particular binding site and here are genes that are co-regulated and here are negative controls or the background and we'll find that this binding site occurs in 100% of the genes in the foreground or the co-regulator genes and zero times in the background. So a clear case of enrichment here. Another thing you can look at is the number of binding sites. So here we might find a motif that you find in the negative controls but you find it many more times in the positive controls. Those are both good ways of looking at this problem and there are a couple of different statistical techniques that you can use to identify them. So one way is you can use a binomial test. We'll give you a Z score and that will give you an answer based on the number of occurrences of the transcription factor binding site relative to the background and you can use that to get a P value. You can also use a method based on Fischer's exact test and that will be based on the number of genes containing it instead of the number of occurrences within the gene. So again, this is the case where we have genes, this is the case where we have occurrences or instances. So there is a software called Opossum which is created by Wyeth Wasserman's lab at UBC and it will do justice for you. So you feed in a set of code across genes from your experiment. It will... So you can just feed in gene IDs. It will get the sequence from the OnCombo database. You can optionally do phylogenetic profiling. Excuse me, phylogenetic footprinting and it will detect transcription factor binding sites and give you a statistical significance of this binding site by either one of these two methods. So phylogenetic profiling is something I haven't really gone into here. It was in a previous version of this lecture but I've taken it out. Phylogenetic profiling is a way of looking at sequences across many different related genes. So one inference you might make is that if something is a real important transcription factor binding site of biological relevance, you might expect to find it in human but you also expect to find a similar binding site aligned to the same region in mouse and in rat and in dog and in all sorts of other related species. So this can be a way of limiting to areas where you find more evidence for the biologically relevant binding site than just the PWM score. I don't actually like doing this anymore because a lot of what people have found is that there are real binding sites where you won't find a phylogenetic footprint. So a lot of times evolution acts differently on regulatory sequence than it does on protein coding sequence and even though sometimes you will find that a site is conserved like let's say you have a mix site in some particular promoter. So you have a mix site and then we have an sp1 site. So you look in human, you've got mix, and you've got sp1. If you look in mouse, you might find the order of these is reversed. This will fail any sort of analysis that is designed to find an alignment between these two species because the order has gotten mixed up. Yet they retain their function in terms of causing the gene to exist within a particular regulatory program. So phylogenetic footprinting can be useful but I think of it as something very stringent and when you use phylogenetic footprinting you can often expect to have a lot of false negatives. Oh, another question. Yes, please. Apossum will actually do all of this. I believe, Bernique, is the integrated lab going to cover apossum? Are you using it? Yeah. So you will have after dinner a chance to use apossum yourself and it should do all of that so it makes it a lot easier for you. Other question? Okay. So the people who created apossum did some amount of validation. So they took some gene sets that they knew were either muscle specific. So they did some gene expression experiments in muscle tissue and some in liver tissue. And then they tried to identify transcription factor motifs that were overrepresented or that were enriched in the genes that are co-regulated in muscle or in liver. And here you can see the results of this. So apossum said that the transcription factor motifs that came up most often for this muscle data for SRF, MEF2, MEF, and TEF1. So many of these are known to be muscle related and there have also been other experiments where people have verified that the transcription factors actually bind to these regions within the reference sets. It is something very similar with liver. So the apossum server, there's not a direct link from that wiki page, but this is what it will look like and you can do all sorts of different analyses either by looking in a set of genes or by looking for combinations of transcription factors, which I think goes to your question earlier. You can find whether a particular set of pairs or a triplet of transcription factors that occur together. So there's a structure of the transcription factor. So there have been solved structures in PDV for a lot of different transcription factors and most of the time a transcription factor with a similar three-dimensional fold will have highly similar DNA sequences. So let's look at this ETS transcription factor here. I mentioned this a little earlier that you can find similar transcription factors of similar binding motifs. So this motif is for a core ETS transcription factor binding site but there are a variety of different proteins that can bind to that binding site. Here's a list of transcription factors in the ETS family. As you can see there's a couple of dozen of them. So all of these, almost all of them will return a good match using this sort of PSSM scoring scheme we talked about earlier. So when you want to discriminate between different examples of things like this, it's important to use some other software. So there's software called Hopkin that approaches that very problem. It will look at similar transcription factor binding sites and tell you what the best one of all of these for a particular set of sequences. Are there any questions on this part of the talk? Yes. Well, I believe you need to have a co-regulated genes and also a set of background genes. Oh, that's not good. If I'm on the right page. That's really helpful. This is the page I was just on. I hope they fix the problem by time for your lab. It's different than the picture. Well, here's another one of the elimination site analysis. Okay. Thank you for noticing that. So you can do single site analysis here. And, you know, this is the set of genes. So, you know, then gene IDs I think actually they'll want different kinds of gene IDs than that. For your co-express genes and then, you know, for the background you can either use all genes or all other genes or you can choose to put in a set that you know are not in the experimental set, the positive set. So you have your choice. You can specify it or not. I believe it only processes a gene list. So I don't think it, at least this version of it. And I don't think the other versions do either. So, yeah, the input is pretty crude. Finally, we will go to the last section of this, which is gene regulatory networks. And all is to be able to predict regulatory regions in a given, oh, this is the wrong URL. Oh, well, given cell or tissue based on integrative analysis of diverse genome scale data. So I'll give you an example here, which is of Segway, which is a piece of software produced by my lab. So Segway takes data from things like ChIP-Seq and DNA-Seq and uses them to segment the genome into subclasses. So you can take observations about the epigenetic state, a particular part of the genome, and you can use those to say this region is something that appears to be regulatory. This looks like a distal enhancer region. This looks like a transcription start site and so on and so forth. This can be very useful as an additional way of determining where transcriptional regulatory activity is real. So I think this is the approach that is favored over using, say, phylogenetic profiling, phylogenetic footprint printing these days, is looking at data from encode or roadmap epigenomics and identifying a priori region as regulatory and then looking for transcription factor binding site. That will tell you about the way chromatin is shaped in three dimensions and will allow you to say this is a region where the transcription factors are likely to be able to get access to the DNA. Often the chromatin is coiled around itself tightly enough in case of heterochromatin that transcription factor or machinery simply can't get access. Sometimes that won't happen, but you'll have nucleosomes in the way. So these sorts of other data sets can be very helpful in figuring this out. Another tool that you guys should be aware about is called GREAT. So GREAT is something from the Beherano lab at Stanford and it is kind of a combination of the tools that we use to analyze non-coding in regulatory regions and the sort of gene set enrichment analysis that you talked about earlier today. So what GREAT does is it takes regulatory regions that you've identified with some other method and so you can take a bed file which is a simple representation of regions in the genome and will tell you what sort of genes are nearby. So here's an output where we took some regulatory regions and we asked it what the function of nearby genes are and this can be associated, you know, if you're on a signal and you want to see an oxygen on this basis and various other sorts of things. So that can be very useful if you have a very particular kind of data set. Yeah, so most of what it... Yeah, I think it should work. You know, most of what it does is it looks for nearest gene. So, you know, if you have something in coding sequence I think it should identify it as being within the gene. You know, again, the important thing to remember about, you know, from an inscription factor's point of view like it doesn't necessarily see the genome as divided into coding sequence and non-coding sequence. So something that tracks a transcription factor could very much, you know, not only be in the first axon, so by first intron there might also be stuff occurring within the first axon. Sometimes the transcriptional machinery itself will get in the way and then that won't happen. But, you know, I think it's important not, you know, not necessarily to disregard coding regions as regions that aren't going to affect a regulation. They definitely do in some cases. Is there another question? So, as I showed you, you know, there was that top option in Opossum to look at single site analysis and then there was the next option, which is combined set analysis. And that is a way of looking at which transcription factor binding sites co-occur. So another way is, you know, the fact that people have identified some of these co-binding complexes that I identified before. But sometimes it won't affect. It won't actually affect the pedium of either case. They just happen to bind near each other. And Opossum can find that. There's something in the Meme Suite which is called Spamo, and that as well. This SBA part means spacing. So it's a tool for motif spacing. And it's something that you can use to find pairs of motifs that occur nearby each other. So if you look in a big, in a site and you find that usually you find, you know, these two motifs, you find these motifs all over the time and you find them 19 base pairs apart from each other. And Spamo can tell you this is something that's actually statistically significant. That's something you might want to consider. You know, are these two motifs in some sort of larger complex? There are a lot of big challenges ahead in understanding transcriptional regulation better. Where, you know, if you look at something like Jasper, the number of transcription factors is actually still quite small. The number of transcription factors humans are believed to have is something between 1,400 and 2,000. And we're still, you know, looking at hundreds of transcription factors. In CIPSEAC data we have, you know, in the human K562 cell type, we have hundreds of transcription factors. The various numbers of various groups in Canada and the U.S. are trying to attack this problem of getting all of the transcription factors. I think this will be very worthwhile once it happens, but until then we have to deal with the fact that we have a limited to look at. People are trying to understand the impact of genetic variation on these transcription factor binding sites as well. So sometimes you have quite, if you have a region of the binding site that's very high information, has a really high letter in the sequence logo, that's something that can be, you know, very important for the regulation of the gene. If that changes via mutation, whether in the course of evolution or in the course of a cancer, people are trying to figure out how that will affect the binding model. You know, will it cause one transcription factor to disappear in another one to appear? People are still trying to develop good models to answer that question. Integration of data sources. You know, a thing I mentioned earlier where you can look at the results of something like Segway and you can use that to pre-define a region of the genome as being regulatory. People have used other approaches like phylogenetic footprinting where they pre-define a region as regulatory because it is conserved across evolution. You know, those are both fairly crude ways of doing this and people are trying to come up with ways where they can incorporate all of the information we have about the chromatin state of a particular region and maybe the evolutionary state, the population state of a particular region to try to figure out, you know, if we can make better models about the transcription factor of a binding. And finally, you know, I showed you earlier how you can go from a crude model of transcription factor motif using IUPAC ambiguous characters to a more sophisticated one using a PSSM. But that, in its way, can still be crude as well. One of the things about the PSSM is all of the columns are independent. And in reality, this may not fit what's really happening. So it's quite likely and I think people have actually shown that, you know, if you have one nucleotide bound in one column of a transcription factor motif that will affect what nucleotides are going to bind at the next column. That's something that's totally disregarded this model and is something that I think people are thinking about how to fix. So we'll go back to this, this three-dimensional picture of what's actually happening in the nucleus. There's, you know, a lot we've learned over the past couple of decades about how individual transcription factors bind to particular DNA sequences, but that context that can be all important in distinguishing the relevant, biologically important transcription factor binding sites from the ones that actually don't end up occurring in vivo or even in vitro, in silico only binding sites is something that is still somewhat elusive. But people are, you know, starting to get better and better models of chromatin. You know, people are starting to understand with techniques like 3C, 5C and high C how distal regions of the genome are tied to each other. People are starting to understand how distal enhancers are linked to proximal regions of the genome and, you know, both where histone modifications occur that might mark regulatory sequence as well as people are trying, people are learning how to determine where those histone modifications are occurring based on the sequence alone. So, you know, this is a complicated hairball of a problem, but there are, you know, many groups working on different aspects of us and I'm hoping that things are going to, you know, become as much better in the next five years as they have been in the previous five years. So, I'm going to sum up on the lecture part here. You know, I told you a lot of different ways of looking at transcription factor binding sites, but I think the most important thing to remember is this utility conjecture. You know, just because you find one, it might not really be something you're interested in. So, it's always important to use results of these, you know, as hypotheses, you know, as things that you further confirm with other methods rather than, you know, doing a binding site prediction and stopping, saying that's your conclusion. It's really something that is used for discovery rather than confirmation of a hypothesis. The final thing, I'll skip most of this, but the final thing that I think, you know, should reiterate is that one of the best data sources for this sort of additional confirmation is the data you can get from CHIPC, from, you know, results of projects like the ANCO project like Roadmap Epigenomics. These are things that will tell you more than just this, the sequence happens to fit a particular PWM, and they're the things that tell you about this crucial chromatin context that's important in understanding where transcription factor binding sites occur. So, I think I'm going to stop there. Do people have any questions at this point? I don't feel like we're at a loss. I feel like they're just, you know, parts of the model that people are actively working on. And that, so, you know, basically for a long time people have had this, this, you know, what I think of is the, you know, the big problem in the bioinformatics of gene regulation, which is you have sequence, you have some information about a particular cell type or tissue type, and how do you figure out which genes are going to be expressed? The problem of which transcription factor binding sites are going to attract real and honest-to-god transcription factors I think is a subset of that problem. And for a long time I think much of this has been kind of a black box. You know, sometimes the binding sites are there, sometimes the binding sites are not there, and, you know, who knows, let's look and see whether the binding sites are there across evolution. It's only been within the past couple of years that we've had, you know, other data sources that we can rely on. So this is the wrong section. So, you know, one of the the focuses of my lab is understanding this problem. I like to look at it this way. You know, for a long time we've, you know, we've had the sequence and we wanted to guess where RNA is. Sometimes we want to guess where transcription factors are binding. There are all of these other variables that are important in understanding that, where the epigenetic modifications are, where chromosome remodelers are, what the structure of the chromatin is, you know, is the chromatin packed tightly? Are there well-positioned nucleosomes here or there? We're finally starting to get lots and lots of data on these questions, and, you know, that makes it much more likely that we'll be able to get a complete model. Before we just had no idea, so even if we wanted to create something that had a model of all that stuff, it would have all been a total gas and would have been totally intractable for a computer to figure out. And I think as high throughput experimental techniques improve, we'll get more and more data that people like the, you know, people in my group will be able to incorporate that into these models and improve them. I mean, you know, computational complexity, you know, I should clarify, it's not a problem of, you know, finding an algorithm that's fast enough, like, you know, people still don't have the model that works. And I think that's the main problem, which is, you know, imagining a model that will incorporate all of these different sorts of data well. People are making stats at this, and people are doing, you know, better and better job at this sort of thing each year. But I feel like there's still a long way to go, and it's not like people are just trying to eke out a small bit of performance either in computer time or in predictive ability. They have a lot they can do. And the other thing is people keep inventing new experimental techniques that we can use as raw data for this sort of stuff. You know, previously unimaginable, you know, now we have large scale data sets that include them. Other questions? Okay. Well, so we start with our coffee break now then, in that case. And we go back to 3.30. Okay. So, yeah. So enjoy your break. If you have any other questions, I will be here. And I will still be here at 3.30, and we will talk about the lab for this function. Thanks, guys.