 Well, what I'm going to do is to get a sense of what's going on underneath the hood because I want you to be doubtful. I want you to think a lot about which transcription factor binding profiles are there. I want you to recognize that there's a conservation analysis step that's going on in most of these tools to focus on regions of functional importance. I also want you to recognize that you can do the exact same things, not for conserved regions, but you can do it for regions that show up in your trip analysis for open accessible chromatin or for co-activator bound regions. So you can take different filters to focus in on regions of interest, not just phylogenic foot printing. That said, the rest of the examples are still going to be using phylogenic foot printing for the case. So the big question that we all have when we come in the door here is how can I look at my set of genes? How can I do something that lets me interrogate, interpret, study which transcription factors might be acting on my set of genes? And the nice thing with sets of genes is you can now look for general patterns. So you can get some benefits from having groups of genes, and the noise sort of levels out as you look at a set, and the signals may become a little more obvious. So I'm going to take you through the process, and then we're going to work on a set of genes, and that's going to be the most important piece to grab from this morning. So you'll remember that our four-gain goal sets of genes co-express patterns that are abundant in one set and not so abundant in the other. So this is an over-representation analysis problem, and over-representation analysis is the same thing that you're doing yesterday with Go. We seek to determine if a set of co-express genes contains an over-abundance of predicted binding sites for a non-transcription factor, and we're going to use phylogenic footprinting to reduce false positive predictions. So there's one difference from the Go analysis, which is that we have two styles of over-representation analysis that we can do. So one type of over-representation analysis is the same, which is are there more annotated genes with transcription factor binding sites in the foreground set than in the background set? Exactly the same, are there more genes with the Go term X in this set than that set? It's the rate of the prediction, it's the number of predictions here, the number of genes with predictions higher than there. But in transcription factor binding sites, we also have the added possibility that are the frequency of binding sites higher than the background. So are there more binding sites on average per nucleotide in this set of genes than on average in the background? So you're looking for a difference in rate here, you're looking for a difference in gene annotations there. Both are relevant for transcription factor binding site studies, and so most tools will give you both statistics. So there's two scores, so the first method is as you heard the Fisher calculations, which is still looking at the genes with the more binding, which set has more genes with binding sites. And then there's the binomial test looking at the Z scores for determining the number of occurrences and seeing if the rate of predictions is higher in the sequences in the first set than the second. Is that clear? You have two scores, but Z scores are for the number of occurrences and Fisher scores are for the number of genes with binding sites. Okay, so this is, I'm going to talk to you about a system called opossum that my lab built. There's another system called CONFAC that does the same sort of analysis. And there's a new one, and I forget its name, but I'll try to put its link up so that you have that one as well to point to. You start with giving, pasting in a list of genes. So you have to give it a set of genes that you think are co-expressed or constitute some sort of functional grouping. The system calls to the genome database, in this case it uses Ensembl because that's computationally friendlier than UCSC, calls to Ensembl and gets back the orthologous genome sequences with orthology as defined by Ensembl. It makes an alignment of those sequences using the ORCA aligner, which was in that tool that you just tried. It uses the JASPER database to make predictions of transcription factor binding sites within those sequences. So the opossum system is human and mouse, and in fact CONFACT is human and mouse, and I'm not sure about others, but opossum as I'll show you, there's a yeast tool, there's a worm tool, and there's the human and mouse tool. If you're in flies, you're out of luck. Sorry. But you can do some of these things directly by just performing the analysis on the sequences and doing the representation analysis. But it is because it's so computationally heavy, you do need pre-computed things to do this efficiently on web browsers, so the web tools tend to be for the relatively restricted organisms. The significance calculations are performed in R, and then you get back a list of predicted mediating transcription factors. So these were just a couple of examples of very clean data sets. So these are sets where we know co-regulation is occurring. Each of these genes is regulated by one of a group of transcription factors, one or more of a group. So the first side over here is looking at a group of skeletal muscle genes. These are genes that are turned on when C2-C12 myoblasts differentiate into myotubes. And what we see in this gene list, it was 23 genes were submitted. It successfully analyzed only 16 of those. So that means that a portion of them, maybe there wasn't a defined ortholog. Maybe they wasn't able to make an alignment to the orthologous sequence. But it was able to perform the analysis with 16. And it comes back and says that the transcription factor binding profiles for the SRF factor and the F2 factor, the myod family, the TEF-1 family are all up there at the top, which are the ones that you would expect. And the C-Mib actually, since that time, has been shown to be a repressor active on these genes prior to the differentiation. So the top five hits were all reasonable hits in the over-representation analysis, and then things get a little squishy. Similarly, this was a hepatocyte set, 20 were input. It successfully analyzed 12, comes up with the HNF-1 transcription factor at the top, HNF-3 beta. Now, one thing that's particularly problematic for over-representation analysis, and in fact all sorts of issues around transcription factor binding site analysis, is interpreting which transcription factor is actually acting. So this free act is a 4K transcription factor profile. It's not one that's expressed or active in the liver, but the 4K transcription factors all more or less bind to the same DNA sequence. Helix loop helix factors often bind to more or less the same DNA sequence. The Mads box factors, SRF and F2, both have very similar A-rich binding sequences. So some families of transcription factors, one profile may be representative of the binding sites for a vast number of members of that family. Can I ask a question? Yep. Why, so on what basis are those particular ones, red, arrow mark, and non-verbal? These were known reference transcription factors that were known to regulate this gene set. This particular gene set? This particular gene set. So this is, and Ron, for your benefit. This is just interpreting scores, and so this was a few different reference collections in seeing what the Z-score threshold and the Fisher P-value thresholds might be for standard gene sets. As Quaid mentioned yesterday, these scores are dependent on the number of genes analyzed. So if you analyze more genes, you might get consistently higher or more significant scores. So these sets tended to be analyzed on the order of 30 to 40 genes in the group. And so these empirically determined suggestions for thresholds are based on that size. Larger gene sets will be different. But this was just to show you that you do get some separation in the NFCAPB reference collection. You've found NFCAPB profiles, liver profiles. You've found those, and the muscle profiles you've found them being distinguished from most of the rest of the group. This is sort of a more real-world data set. This was some sage data that was linked to CMakeover expression. And you see that in this particular experiment, the genes that came up as being differentially expressed had signatures for a MAC-max profile, a MAC-max profile, a MAC-profile in this list. And so you again see this also this idea of correlated profiles coming out as sets in the group. The opossum system that you're going to use reports Z-scores, which essentially is how many standard deviations away from the norm RU. And the Fisher score, as you learned about yesterday, it doesn't report FDR. We probably will do that in the next iteration. So FDR would be a good thing here, but it's not given. OK, so coming back to this point that most in many structurally related families of transcription factors, they bind to similar sequences. So these are binding profiles for different members of the ETS transcription factor family. You'll notice that there's basically a GGAA pattern that's predominant in the family. This is GGAA on the reverse strand, so that one's flipped. And so the fact that you hit an ETS profile as being overrepresented doesn't really tell you which member of the ETS family is acting on the gene. And so the interpretation is harder because you need to then figure out which member of the family is there. This doesn't uniformly apply to all families. So the zinc finger transcription factor family is the biggest and most notable exception, and that's each member of the zinc finger family pretty much has its own binding specificity. The zinc finger family also constitutes about half of all DNA binding transcription factors, and most of them have not been profiled. So we're going to come back a little bit later and talk about what you might do to get to those cases where you have a transcription factor that's not profiled. That said, if you hit on an ETS profile, or you hit on a Helix profile, or you hit on certain SOX profiles, you have an indication that the member of the family might be there, but you don't know which member to emphasize. So what can you do in those cases? Well, one thing you can do is you can download the list of all the members of that family. There's a tool called TFCAD. There's another one called DVD, which have lists of all the members of these families. You can then take those lists and you can say go to a tool like Topgene, which is a tool originally designed for genome association studies or linkage studies where you're trying to pick a gene from amongst a set of genes that might be more closely involved in a process. So you'd say, OK, I think my gene has something to do with my system that I'm studying has something to do with muscle or has something to do with bladder cancer or something to do with, if you like. Topgene then would prioritize these based on the literature largely and see which members of these families, which members of this family have a connection to the term that you're interested in. I'm not going to go further on this point today, but it is an issue and it's not a perfectly clean solution. So what have we learned here that there are tools to interrogate the meaning of clusters of co-expressed genes? So the slides got removed for this one, so let me comment on this point. Lots of us do large genomics projects and we do them for a variety of reasons and we do our expression profiling in a different variety of situations. Most of the time we don't design our experiments to interrogate the regulation of a set of genes. So when you are doing an experiment, if you are looking at two transgenic animals and one of the animals has missing a gene and you're taking some cell out of them and comparing the two, you have a lifetime of differentiation process that's been affected. So you're looking not just at primary effects of your gene, but you're looking at secondary, tertiary, paternity effects that are acting together. And so any gene expression signature you see is unlikely to represent one transcription factor or one set of transcription factors. It's going to represent lots of processes that have occurred over time. Likewise, if you're looking at 24-hour steps in cell culture, you've had many events that have happened over those 24 hours and you're going to have a very complex set of genes responding. So what you want for promoter analysis is you want experiments that precisely pin down sets of genes that are likely apart the same response. So the ones that work well tends to be experiments where you're looking at a differentiation and you have a cell that you've looked at at several time points on close steps over a differentiation process, meaning half hour or hour steps along the way. Or you have a genetic system where you can do a transient knockdown or transient change, an inducible change in the system so that you can see a direct effect of one thing. Or a chemical system where you can put an inhibitor in and you can look at the immediate effects of that inhibitor on the process. So these types of experiments tend to be more favorable to regulatory sequence analysis than long time points or big developmental terms. Question? How do you choose when to look at it or do you all need to do time force analysis? You, in most cases, you must have something in the literature and you're going to need some sort of time force analysis to see when things are happening along the way. Look at trends and look at groups and so largely what you're looking for is some pattern of expression, some set of genes that are following that pattern pretty closely and you'd rather have it focused on a small group that's doing something very close together than on a large group of genes that are probably representing lots of different events. So what's the situation where you have multiple effects, like instantiation, how many of them are in multiple, as we just talked about in the end? Yeah, so the nice thing with this representation analysis is it's, compared to some of the other methods, it's much more tolerant of noise than the others. So you can have, you can even have multiple responses in the same system and oftentimes you'll still pick out a signature within it. My experience is that if you add 50% noise to a gene set, so you randomly toss in genes to constitute 50% of set, you can oftentimes still pull out the pattern, but not much more than that. So that's sort of the upper limits. And of course it depends on how clean your first set was. So 50% when you got a pretty clean starting set. But when you're looking at tertiary and later effects, your gene, any individual regulatory program may only constitute 15% of the set. So it can be quite difficult. And what would be the minimum number of genes? So my experience is if you have less than 10 genes, you're never going to get anywhere with this. So less than 10, there's really no point in bothering. And generally the sweet spot seems to be somewhere between 20 and 50. So if you have 20 to 50 genes that seem to be part of the same regulatory program, there's been a well done expression study or chemical inhibition. And you get 20 to 50 genes that constitute a real cluster in your data. Then this method will usually work. So I'd have to look to make sure that I'm on the same page. There have been lots of experiments done recently to look at sequences that are conserved over long evolutionary periods. There have been lots of experiments to take, for instance, Fugu conserved regions, human Fugu conserved regions and test them and put them into databases. But these are sort of gene-specific ones rather than the interpreting sets of genes. So I have to look at that during the lab and I'll come and look at it and see which one's there. Okay, and then the final point was that the identity of the mediating transcription factor may not match the name that comes up on the profile that was used to do it. But the class may be more relevant to you. So another way that a lot of people are finding it works reasonably well on this last point is to look to see which transcription factors of a class are actually expressed in your cell of interest. So oftentimes in your expression data, if you're emphasizing you may, I see a net signature, an obvious question is, is there a net factor that is expressed in an interesting way in my system? And that might be a good one to start on. Okay, so I think we've had lots of questions. So this is the big lab, I hear a question. Okay, this is the big lab. So what I'm going to do is we're going to, I'm going to run one set through opossum for you and then hopefully if you've got a data set that is suitable, you can go ahead and try your gene list out in the system and see how it goes and you can begin to look at it. So I'm going to take you to that point and walk you through one set, just a minute. Okay, so I'm back to the Wiki Workshop page and you'll see in there there's the link to the opossum system, sort of shown as an 11 with the double arrow. So go ahead and open up opossum. Come over to the opossum page. You'll see that there's four different variants of opossum. There's a human single site analysis which looks for single overrepresented binding site patterns. There's the combinatorial site analysis. This is computationally extremely slow. So we're not going to do that today and you should be here. But you're welcome to try it out later if you're interested in it. So what it looks for are their overrepresented combinations that come up. There's a worm one. I didn't double check the worm one for functionality today. So hopefully it's up and running. And the used one should be working, I did check that. So we're going to focus on the human single site analysis. So go ahead and click enter on human single site analysis. I can find my mouse. Okay, so you're going to give it a list of genes. This is human or mouse. You're going to give it an indication of what the gene ID type is. Ensemble, Hugo or MGI gene symbols, RefSeq IDs or Entrez genes, and you're going to paste in a list of genes. So let's do this a couple ways first. First we'll just take the sample gene set just to give you things. So just click on use sample genes. And step two is to choose your profiles. So it's using Jasper. Same idea as you had in the Orca TK exercise where you can restrict it. We'll just take defaults for now. And then you can change your parameters. This is same idea as before, the level of conservation sort of top 10% of concerned regions, the matrix match special that we've talked about. The number of results to display, let's give it a little more than 10. Maybe we'll take 20. And the sort results by it reports both as that score and the Fisher score, but you can sort the list in either way. So I'll just leave it as the default. I'll hit submit, see if we can kill the server. So what it's done is it's got a pre-computed set of alignments, pre-computed set of transcription factor binding site scores. It takes the criteria that you've given it and it goes through counts how many binding sites are in your gene and compares that to the background. And the background is assumed to be the whole genome in this run. It's gonna take a minute. Now there is an advanced version where you can go in and plug in your own background. So that is an alternative if you have an array list. There is a one that will run with the array list as its background set. You still have to put in two different sets, right? One for the background and one for your. So if you wanted to find the background, you have to give it a background set. It will run in default with all genes for which there is a mapped or thought or thought between human and mouse. So it's. So here it's critical to the ID, it's critical to find the right ID. And it's not a very smart system. So you pretty much have to tell it the right ID. Yeah, so it's based on ensemble underneath. And so if you want, if you have an ensemble ID, that's your best one. Because you know it's not gonna have mapping problems along the way. But it works with what ensemble has mapped for the gene symbols. So while we're waiting for this to come through, the custom analysis tab there, which you can see on the top of the screen is where you can go to set up your own background set. You can also adjust some of the parameters more loosely. You can do this for those of you who are computer programmers. You can do this directly through an API. So you can call into the database and do this programmatically, and not have to go through the web interface if you want to do it lots. And if you want to, you can take the whole pre-computed database of all the binding sites and all the conserved regions and take that away and work with it. And this poor server is not happy with all of us hitting it at the same time. So while that's shooing, I'm going to go ahead and set up to do one more task on it. I'm gonna open a separate opossum window. I'll do another single-site analysis run. But this time I'm going to use my own gene list. So in the data files, I've got a gene list, mod three gene list down here. And I'm just going to take that set and run it through. So this case it's a gene symbol list, go to an ensemble list. So I'll take the Hugo gene symbols, I'll paste them in. Yep, that's the one I'm running. If you have your own gene list and it's a human mouse, you can try that out in here. Or if you have a yeast or a worm set, you can go to the corresponding opossum system for that species. And I'm just going to take all default parameters. And as long as we're killing the server, let's finish it off and have it do that. Say that again. So the question is, can we just examine the gene list for transcription factors? So it depends on your experiment. So sometimes the transcription factors that are regulating your set are already there. So they may not be highly differentially expressed in your set. Sometimes you're interested in potential silencing that goes on of your gene set. So you might see signatures of genes that are turned on later in the process that would show up in your set. So it's not guaranteed. But in general, it's interesting to note which transcription factors are in your list. And so there are archives of lists of all the DNA binding transcription factors. So there's two major ones. So there's DVD, which is the DNA Binding Domain Database, I guess it's called, which is maintained at, I think it's at the EBI. So if you look at DVD and transcription factor, you'll get it. And they have lists of all the transcription factor genes for different species. And then there's one called TF-Cat, which has human mouse transcription factors. So here's the analysis results for the second one. First one hasn't come through yet, so that's interesting. So this was the Mod 3 gene list. What you see is the list of genes that it successfully analyzed from that list. So in this case, they all worked. There's usually be some in your list that don't. And then it gives you the list of the profile that comes up as being overrepresented. The TF class, the TF class supergroup, which you don't really need to know about. The information content, which is a measure of how strong the pattern is in that profile. Then you get the number of genes in the background that had that binding site. And you get the number of genes in the background that didn't have that binding site. You get the number of genes in your list that had that binding site. And the number of genes in your list that did not have the binding site. And then you see the total number of background hits. And then you see the rate. So that's the, how many hits per nucleotide were there. And you see the rate in your targets list, your foreground list. Then you see your Z score and your Fisher score. So what you'll see here is that this list is enriched for these homeodomain transcription factors. So there seems to be a signature there that there's a potential role of homeodomain transcription factors. So you can pop up the pattern of the binding site by clicking on the name of the transcription factor. So there's the first one of the homeodomain list. And then you can pop up the next one. And you'll notice that they all have this TAAT pattern at the core of it. Another one. So they do look very similar. So you begin to think that there might be some signature of a homeodomain transcription factor. Pop the list with this HMG transcription factor. Looks a little bit different. So this sort of analysis would lead you to think, well, maybe there's something to do with the homeodomain transcription factors in regulating my list. Maybe I should take a closer look and go see if there's a homeodomain transcription factor that might be consistent with the biology that I'm working on. I'm going to, I think, kill this analysis. So these ones, these scores are very high. So these are extremely high as that scores. Probably because this list is a little bit tailored for demonstration purposes. So, and again, these scores, both the Fisher scores and the Z scores, you'll get, depending on the size of your list, you'll have different interpretations. So the general default is if you're looking at 20 to 50 genes that Z scores over 10 and Fisher scores under .01 are interesting. Sure, it's also in that figure with the empirical plot with the dashed lines on it. So a few slides back. So just up here. So that one. In general, there seems to be a pretty good distinction from the sort of background rate of predictions and the known regulators that seems to fall above 10 for the Z scores and below .01 for the Fisher scores. But these are sets of sort of 20 to 50 size. Yeah, we don't transform it because there's some issues around the distribution that we're not completely comfortable with. And so we weren't quite ready to convert it to a p-value to it. So we use it as a scoring step, but we didn't want to convey a sense of p-value to that particular score. Whereas the Fisher would feel a little more comfortable with calling it a p-value. It's not corrected, the Fisher p-value. So there's no Bonferony correction or anything else in that. So hopefully you're in there and giving it a try. So I'm going to give you about 10 or 15 minutes to go ahead and try using opossum on your own set or use it on your favorite. And the alternative would be to go in and look at your favorite gene and try it out in some of the promoter analysis. So take about 10 or 15 minutes and work away. I'll be around the room. So feel free to grab me. And I'm going to point you to try to give you the briefest of introductions to how those things work. It's a good thing Quaid's not in the room because he'd probably object to some of my terminology that I'm going to use here. But I'll give you the main idea about what's going on. So same idea, set of genes come out with the patterns. And this is why I'm normally on a Mac and not a PC because I just ate my slide. So there's loosely two ways of doing this. One way is around strings and one way is around quantitative matrices. So same as before, you can think about a consensus sequence that's being looked for versus matrix profiles of patterns. I admit to a very strong bias to the latter because I'm interested in quantitative patterns and I think strings don't get me very far. That said, strings have been enormously successful in yeast and other properiods where you have short patterns to look for. So if you're in those worlds, string-based methods are probably the quickest. The string-based method of choice right now seems to be Weeder. So that's the one that's been performing well in this amateur link on the Wiki to the Weeder system. And the profile-based methods, there's a variety of them. We're going to point to mean, particularly the convenient example, means an oldie but a goodie that's been around for a long, long time. And it uses a slightly different method than what I'm going to describe to you. So I'm going to give you sort of overviews of both of those approaches in simple terms. And I want you to recognize that there's lots of extra new gaps in the tools put in place. So string-based methods are essentially the same as what we've been doing. Instead of looking at transcription factor binding sites or looking at code terms, string-based methods look at the string. And so they count. How many times does each pattern, possible pattern, occur in your gene set versus the background? So what it does is it determines our finding x occurrences of a word in set of sequences significant given the background sequence. Exactly the same problem with the individual story, on go and on motifs. And so the way this works is it makes a lookup table. So for each possible pattern, it says how frequent does that pattern occur in the background set? So for instance, the pattern AAACCTTT occurs 456 times in the promoters that have been analyzed in the background. And the pattern TTTTTTT occurs 57,788 times. And so now when you take your foreground set of gene to a group of tests that you're analyzing, and you say, OK, I found a bunch of T's as a significant. And you compare it to finding 57,788 in the background, and you determine that that's not significant. Or it's the same number of the AAACCTTTs where you just got exactly the same method. It's just doing it for every possible string, certainly. Now in the old days, I used to say, well, it works for strings of like 6, but computers were slow. Now I can say it's working on strings of about like 12. With degeneracy, computers are no longer the problem. So these methods are fast. They look not just for single strings like this, but they usually have degeneracy codes to look for how frequent each degeneracy is. And so they're doing very elegant analyses for consensus sequences. So that's. But how does they know what to do? They just have every single string, and they look and see if they're in the wrong place. They go across your whole string. So when I say it works well for Easter, that's because you only analyze 1,000 base pairs or so per team that you work with. As soon as you plug in to work on something large, you can imagine it takes them a long time to go through these strings. It's a question of that. Why is that a profound effect 12? Once you get much over 12, the computer processing time takes forever. This is an exponential explosion. You're limited by time. You're limited by cycles. So by speed of the computers. So not random, this isn't actually much memory. But the problem is, would you consider all possible combinations? You have a factor of 4. So it's an exponential increase factor of 4. And by the time you get it up to about 4 to the 13th hour, it's going to take a long time. So 4 to 12 hours is going to be working reasonably well. And I'm sure super computers and some people have part of it a little bit better. But it does become a problem. OK, so it's just doing the same thing. It's got to give you a set score, same statistic that you learned yesterday, correct it for the variance in the mean, and giving you a score back. So there are some limitations on word length, not so much as it used to be. I have my personal bias that I'm not real fond of degeneracy codes, because I don't think they quantitatively reflect the patterns that are there. The string-based method, though, for microRNA target sequences is going to be extremely successful. So microRNAs seem to be much more favorable to these string-based methods than the others. So if you have a bunch of UTR sequences you want to analyze, I'd go use a string-based method. The microRNAs don't have so much variability in their target sites, so they tend to have the same nucleotide in each pattern. So a string-based method is very much in line with the nature of hybridization of these trends. But I'm biased to promoters and transcriptions, so I'm going to be biased towards a slightly different way of doing things. This is just a reference to my bias. OK, so now I'm going to talk about how to look for patterns that are profiles or matrices. And I'm going to introduce you to what's called the GIB sampling-based procedure for pattern discovery. GIB sampling requests are lots of different problems, but for a protein discovery, there's sort of a specific way of doing this. And I'm going to hide a lot of layers of complexity here. And I'm going to give you the essence of what it's doing. So same idea. We want to find a local alignment, so a matrix we thought of as a representation of an alignment. So you're going to find a local alignment with X that maximizes the information content. And there's a few different ways of doing this. And we've talked about being motivated for this. We're going to do this in a probabilistic framework for the example that I'm going to give you, which means that you're going to have a lot of different answers at the time you run the program, because there's a stochastic element to it. But most of those patterns should end up being similar. And what this ultimately means is we're going to guess our way of doing it. Are you ready for some guessing? OK, let's see if I've got the next. So this is not my favorite slide. Here we go. So here's the sampling in the nutshell. You can use some of the other sites for more information. So the first thing we do is we guess that the buttons are complex. I really mean that with guess. Let's say the sites are located here, here, and here, and here. So chances of getting the right ones, pretty much zero. So what we're going to do is we're going to say, we might wiggle them a little bit. We're going to make an alignment out of those sites. We'll make a profile. The profile is going to be a bunch of noise. But now what we're going to do is we're going to say, let's throw out that guess. We'll actually throw that one out from the profile too. So let's keep the sequence out. And now let's try to figure out where the pattern's located in this particular sequence. And the way we're going to do this is we're going to take this profile and we're going to score each position for its matching pattern that we are. So most of the time, this is meaningless. We've got a noisy profile. It doesn't mean anything. But what happens is once you get one sort of relevant pattern, it's going to bias you to finding a pattern similar to the one with sequences. It's not going to be a very strong bias. It's going to bias you a little bit. And once you get two patterns in that profile, two of the sequences contributing to that profile, it's going to bias you even more. So what happens is very quickly it starts zooming in on the pattern that is relevant. So in this case, in a true Gibbs sampling methodology, you usually guess which position in proportion to the scores. In a EM method like Meme, you will take the maximum position, so you will take the best scoring site. So now you replace that. You place it under that high peak. And you repeat that process over and over and over and over again, maybe 5,000 times, until the system doesn't really change the sites there. Stabilizes on the set of sequences, or it stabilizes at least on the quality of the pattern that it has. Now, there's a problem with this. And the problem with this is at the beginning of this first stochastic step, you could get yourself a kind of pattern that's made. It's patterned, but it's not the best pattern that's in the set. So what you want to do is not only do you want to cycle as many times as you hear, but you want to repeat the process from the beginning, many thousands of times. So you let the computer do the guessing 5,000 times, and you let it do the searching 5,000 times for each guess. And then you report back, perhaps, the strongest pattern, or the few strongest patterns that emerge from that analysis. And the stronger pattern essentially measures how it's striking as the pattern that's in conservation. So that's what mostly pattern discovery tools are doing underneath of it. And they've by and large statistically proven that if you do this enough times, you will find the optimal pattern. Are there bounds? So there is issues around the perception of the pattern that's there. So technically, the program will tolerate often times quite long sequences. But you have to recognize that you're trying to find some pattern against the background of the noise. So the more sequence you give it, the more noise you're giving it, and the more patterns that will arise by chance. So I'm going to give you an example. Let's take just a simple case. Just give it a bunch of binding sites. This is the case for Bap2, which is doctor. And we'll let the pattern discovery procedure apply on this. It should do this real good. Usually, there's nothing else there. It's going to find the right pattern. Now we'll tack on some randomly selected promoter sequences to the edge of this. We'll do more and more until you lose your ability to recover the pattern that you're looking for. So the question is, how much sequence can you add on the edges of these? Until you lose your capacity to pull up the pattern that you want. It depends on the information content of the pattern you're looking for. So a very strong pattern. You can have a lot more sequence than the edges. It's a very weak pattern. It's going to disappear without the short. So you have the slides. So it's not really much of a guessing game. Most people think that you're going to be able to pull these patterns off much more sequence than you can. The reality, MF2 is a strong pattern. By the time you've added 500 base pairs of flanking sequence, you've lost your ability to recover MF2 relative to the patterns that are found in random tools. So blue is with the binding site. Present pink is with a random set of sequences. And what you see is up to 100 base pairs of basically always hitting the right site. Then it starts putting in some less good sites by the time you apply 500 base pairs of flanking sequence, you're not recovering the pattern. So that's pretty limited. You have to be pretty convinced of what you're working on. It works pretty well for chip-seq regions. Those regions can be relatively short. You have a couple hundred base pairs in the chip-seq. It works OK. Another thing that helps in transcription pattern finding sites is oftentimes there's multiple sites for the same transcription factor in the sequence. So the patterns are a little stronger relative to this one, which will be one per sequence. But it is not great. That said, oftentimes people want to give it a try. So you're set to give it a try. Let's go to do it. One thing I'll mention just quickly. This score on the y-axis sort of hitting away from you is the similarity score between the profile that's recovered by the pattern discovery program tool and the known profile. And so this is just like doing a blast analysis. You have to recover some pattern. You'd like to see what pattern to the database pattern to the most similar to the one you recovered. And so there are a couple of different tools for doing this type of comparison. The one that I've recommended to you is called TomTom. And TomTom is part of the meme package. So you can go to TomTom at the frequency matrix, paste it in, and we'll do a comparison against the desperate database to see which profile it's likely to attack on. So if you want to give it a try, it does take meme a little while to come back. So let's just pop it up quickly here. So in the wiki, there is a gene list sequences. So I pulled out the promoter sequences. So for your set, if you want to try it out on your own set, you would need to go to biomark, and you need to use it to pull up, for instance, 500 base pairs or 1,000 base pairs adjacent to your genes of interest. So if you have 1,000, the chance that you get to try it is 500. Depends on the strength of the pattern, and it depends on the number of occurrences of the pattern. So 500 is the first pass. So here's some gene lists. Here's some sequences I've given you one. If you want to just try to do it on a set, just copy that. And then in the wiki, there's a link to a meme, and you can go off to there. There's also a link to Weeder if you want to try it out in Weeder. And so a meme has a bunch of tools. We're just going to start with meme to do our analysis. It's kind of a slow process, so it's an email-based response. You can either upload a file, or you can just paste in your sequences. And then there's some settings. And generally, I'm going to say, just go ahead and take the defaults, but I'll try to tell you what they're doing. These ones are looking at how many occurrences of the pattern to expect upon them. So one per sequence, it will require that there be at least one occurrence of the pattern in every sequence it analyzes. Zero or one says it's OK for a sequence not to have the pattern, and any number of repetitions. For meme is originally designed for protein analysis. So protein motifs tend to be longer. So if you don't change the setting, it will work pretty hard to look for long patterns. But for binding sites, you can probably limit yourself to something on the order of 15 long, and you'll recover most of what's there. And then down below, you can give it a name so that if you're doing multiple things, you know which one you're getting. You can put in your own background model. This is beyond what you're going to want to do, but you can try to correct for the background characteristics. And then you choose how many binding sites it will look for within your build, just take defaults. So you start the search, and it goes off and works on it. And it will send you an email when it comes back. So I'm just going to do one more thing with meme while this is working. Go back to the Wiki page, which is take a profile that's come back. Actually, this is a stat binding profile. So this is a physician frequency matrix, ACGT going down. And I'm just going to compare that to the database of profiles so you can see what you can do to compare once you get a profile back in your meme report. So go back to meme. And the meme suite is the tom-tom program. So you go to tom-tom, paste your pattern into the box, it's comparing it against Jasper. You can choose different scoring systems for using. I have a personal preference for that one, but you can do it with Pearson correlation coefficient, which is what most people do. And then you can start your search, and it runs the comparison. So this is the pattern I gave it down on the bottom. And this is the Jasper profile pattern that it found was most similar to the profile. So this was supposedly a stat transcription factor binding profile, though I didn't confirm that. So it could be an annotation problem. And this is a profile in Jasper called MA0051, which is labeled IRF2. You can. So this is reporting all MOTIF matches with the score better than 0.5 on their scoring system. So it's a thresholded procedure. And I didn't notice in the intrigue parameter if there was a chance to control that score. But if there's more than one hit, it'll get it. So let's just click back here and see if there was a threshold to give on that. It doesn't look like it. So you pretty much get there thresholded for what it is to report. So anyway, when you have a profile that comes up, it's from your chip data, and you want to see what's coming up. Or if it's from a promoter analysis like this, Tom Tom's a convenient tool for comparing to the database and seeing if it looks like anything else. So go ahead and give that a try. If you've got a gene set, try to pull out the promoter sequences and feed it into meme. And you can try Weeder as well if you want to. I think I'll go around helping people rather than do a Weeder example. But just to point you in the right direction. This is the Weeder version 1.3, which is sort of midway down the page on the way. And then you go here to start inputting data. And it's basically the same input, fast A sequences, choose a species, because Weeder is a string-based method, so it has to have a pre-computed database of counts to work with. And then let it chew away, and it will set up and send you an email. So that's pretty much it for what I was trying to walk you through. I hope you're having some success with the promoter analysis. And I'll try to hang out for a while during the lunch break, so that people have questions about gravity. And I'm happy to help you out.