 Hi, I'm Michael Hoffman and I'm going to talk to you about gene regulation and motif analysis. So by the end of the lecture today, you're going to understand challenges in predicting transcription factor binding, be able to identify binding sites for known transcription factors and be able to discover binding motifs using various software. So we'll focus on your eukaryotic transcription and then I'll cover a number of neutral topics related to it. So here's a oversimplified model of what transcription is. So you have a stretch of DNA and there is a region in the DNA that is a transcription factor binding site. The transcription factor binds to the DNA at the binding site. It recruits RNA polymerase two RNA polymerase two produces RNA downstream of that you get function of some sort. So it will change a cellular activity change, change the way the cell acts, all sorts of possible things. In reality, of course, this is much more complicated. You know, in humans, there are more than 1600 transcription factors, sometimes they act in concert, they cooperate, they compete, but yeah, this is understanding how this works for even one transcription factor is still quite a research and computational challenge. There's a lot more that happens than just having, say, a single transcription factor binding site. If you have a particular gene, so let's have a transcription start site of the gene right here. Sorry, this isn't working. There we go. So let's say you have the transcription factor, sorry, transcription start site right here. You know, there are a variety of transcription factor binding sites that might be nearby and things can also be affected by distal transcription factor, transcription factor binding sites. The way things work as far as distal transcription factor binding sites being connected to transcription start sites that are far away is often by action of the 3D organization of the genome. So often things that are considered distal and are far away along the, along the length of a chromosome can actually be pretty close in three dimensions. I keep hearing a little echo whenever anyone comes into the room. So people have started looking, you know, at various other factors that can affect whether transcription factor is going to bind at a particular position. So here's a diagram of a chromosome zooming in through several orders of magnitude until you get down to the level of individual nucleosomes, individual DNA based pairs and so on. And people have developed a lot of different assays where they can experimentally assess different aspects of the chemical and physical properties of the genome at particular places. So for example, you have techniques like DNA seek, attack seek, fair seek that can tell you where regions of open chromatin are, you know, these are often the regions where transcription factors can actually bind, which can be very important, you know, techniques like chip seek, which will, which will tell you both which parts of particular histones are modified in particular ways, like, you know, methylation, various other things that modifications that can happen. There's also chip seek for finding the transcription factors themselves. And then RNA seek will tell you where all of the genes that are being transcribed are that you can also use RNA seek in a variety of different ways that will tell you about what's happening in different cellular sub departments and look at things other than just messenger, messenger RNAs. So the encode project was something that that started 13 years ago or so. And NIH decided to look at a number of human and mouse cell types. And there were also some extensions to worm and fly and see how much we could figure out about the context of the genome in particular particular regions. So, you know, in particular cell types, can we see the difference in which transcription factors are bound, which regions of chromatin are open? And from that, we can create a model of gene regulation that includes not just where the genes are, but it also includes things like where cis regulatory elements are and things like where long range regulatory elements, things like enhancers and and so on. So encode published their results in and across a couple of different phases. So I'm showing you here. Journal covers from both the first and second phases of encode. There have now been two other phases of encode that are produced ever more information on what is going on, mostly with transcriptional regulation in the cell, also some translational regulation and all of the stuff you can you can find, you can find all of the data that they've produced free on the web at encodeproject.org or you can use things like the UCSC genome browser or ensemble all have access to these these various sorts of data. So it kind of gives you an overview of transcription, transcriptional regulation overall. And now I'm going to delve quite deeply into to one part of the simplest part going back to my oversimplified version of things at the end beginning, which is how do you model the the binding of individual transcription factor binding sites by transcription factor before I go on. Does anyone have any questions? OK, and by the way, if you have any questions at any time, feel free to either speak up or I'm trying to keep the Zoom chat available for, you know, to identify your questions over here on the right. I can't also do the Slack chat, unfortunately, at the same time. So if you have any questions you want me to answer in the middle, please feel free to type them into the Zoom chat. Hi, Michael Francis here. So we've actually disabled or or not using the Zoom chat. We are following the Slack chat and will interrupt you. The instructors want to interrupt you if they think it's they'll either answer the question directly for or interrupt you and ask you. All right. So and then at the end of the class, you can check the Slack if we got anything wrong or we forgot to answer something. Yeah, unfortunately, I don't think I know it's OK by informatics.ca on my on my Slack computer here on your spare computer. OK, anyway, we're there to cover for you. OK, awesome. Yeah, go ahead. Sorry. So teaching computer functions for some vector bunny sites. So let's talk about what transcription factors recognize. So you might find here an example of a single site for a transcription factor and their variety of you know, both in vivo methods like chip seek and the variety of in vitro methods you can use to figure out what sort of site a transcription factor likes. The thing about transcription factors is that most of them bind to degenerate patterns, right? So there won't be just one sequence of DNA that a transcription factor binds to. There's going to be a whole set of binding sites that you might identify through through an experiment. You can see here an example of some of these, right? And so you can see that there are there are places in these patterns of this 30 or so binding sites for that have been identified for this transcription factor where things are the same. Like, for example, column four here is always a T, right? Column five is almost always a T. There's one case where it's not a T, right? Then there are other columns where, you know, column one is, you know, all over the place, a CG, CGC. It's I don't think it's ever T, though, right? So so so they're quite complicated patterns that you can see. And and whether transcription factor will recognize and there are a variety of ways of representing that, too. So one way you can represent it is by using consensus sequence, right? So the International Union of Pure and Applied Chemistry came came up with a a extended alphabet that allows you to represent any combination of DNA bases, right? So, you know, you might have seen, you know, R means either A or G, you know, Y means either C or T. And there's a whole series of these for for all sorts of other combinations of bases and even little mnemonics that you can you can use to remember them, right? So V is what what comes after, you know, comes after T slash U. That means not T slash U. So it actually means A or C or or G and so on. There's, you know, and there's there's other mnemonics you can use to learn different parts of these. This does not really. Summarize what we have over here on the right very well, right? Because, for example, you have things like this first column, you have this V means either AC or G, right? Whereas this fifth column we have represented as W. Technology, right? OK, let's try this again. This fifth column as as W, right? W means it means weak A or T, right? But this column, as I mentioned earlier, it's only a once, right? It's almost always T and you're treating it kind of the same as as this one, which is, you know, AC or G all over the place. So people have come up with more complex, complex models to represent how a transcription factor works, right? And most of these are based on something called a position frequency matrix. What you do to create a position frequency matrix from a set of binding sites is fairly simple for each column in your aligned binding sites. You simply count up the number of times you see each base, right? So you see C one, two, you know, three. OK, it looks like this PFM doesn't actually correspond exactly to the set of binding sites because we actually see C four times here. So this should be four, right? I think the G here we see three times. Now we see it four times also, right? G, we see four times and so on and so forth. And, you know, you can go to column four here, which you can see is always T, each of these 21 times column five, which is T all but one time and so on and so forth. So this is this is a pretty good way of taking any set of aligned binding sites and turning them to a very simple model. The model is a little too simple, though. I'll tell you about why that is in a second. But one other thing you might want to consider is that, you know, this is this is not very easy for a human to look at and interpret. So people often represent what a position frequency matrix or similar matrices look like using these sort of sequence logos, right? Many of you have probably seen these before, probably fewer of you know that under under the hood, you know, this is just a representation of some sort of matrix matrix model. And you can see things like the fact that column four is always T, right? Column five is T most of the time. It's a little bit of the time. And, you know, this sort of representation makes it much easier for for us as as human scientists to see which are which are the, you know, important parts of the motif of the transcription factor, right, and which parts are not very important, because you can see that there are they are smaller, which means that there's more variation in those positions. Right. So in reality, actually, the the sequence logos you'll see in a paper usually are not directly from a position frequency matrix. We convert them into something called a position weight matrix first because that's something that is more generalizable to to many other contexts. So I'll show you how you do that here. I mean a much smaller P F M as as an example, right? So this is one that was derived from only five counts. And it has only only five five columns as well. So you start with the the position frequency matrix for any given base and given position within the motif and you apply a number of corrections to it, right? So the first thing you do is you correct for the nucleotide frequencies in the genome. So various genomes are either it rich or GC rich. If you are looking in an AT rich genome, it will be less surprising for you to see a string of A's and T's, right? Essentially, this is something this this dividing by the frequency of that base is a way of correcting for that. The second thing that we do is we wait for the sample size of the position frequency matrix. Essentially, we want to take the fact that if we say had 30 observations and got particular frequencies from them, that would give us more information than say only the five we have here, right? That's kind of the intuition we want to we want to add in. The other aspect is that we're we're creating a probabilistic model with these. And the thing about probabilistic models is when you have any zeros in them, you know, as you know, zero multiplied by anything is going to be equal to zero. So, you know, you could eliminate ever seeing CG or T from any motif match of this motif if you left these at zeros. And we don't actually want that. We want to indicate that, say, an A is much more likely in the other three or less likely, but we can do that just by taking these zeros and adding adding a pseudo count. So how do you know which pseudo count to add? I will say usually people add one, right? So that's the the minimum that you can add that, you know, is a nice integer. It makes everything makes nothing zero within the within the matrix. So if you added that here, you would get six one, one, one, three, four, one, two, three, two, two, and so on. You just add one to every to every cell in the matrix. And it, you know, that one is weighted against the number of initial observations you had. So again, if you had 100 observations to start with, that pseudo count of one you're adding on isn't going to change things very much. If you have five, you should be way less confident in the model you're generating just for those five and it will. And the third thing that we're going to do is we're going to take the log of the results of steps one and two. Right. And so that converts the position frequency matrix into something called a PWM position weight matrix can also be called a position specific scoring matrix. Taking the log just makes it, you know, mainly it makes it easier for computers to to deal with it, right? You might think that computers are very fast at multiplying, but they're actually way better at adding. And if you are scanning a genome of three billion base pairs for some of the teeth, it is to a big advantage if you can just make sure that you are, you know, adding. You do all that you do a log when they transform in advance and then you just have to do a bunch of adding. You also don't have to worry about various sorts of underflow errors. In case anyone here is concerned about that. So here's an example of how you'd use a position weight matrix to score a particular region of the genome against against the motif, you know, for a transcription factor, right? So we want to score T C T G C T G. Right. All we do is we look at the appropriate column in the position weight matrix T and find the right row for each column T G C T G. You take all of those, you add them up, right? And then you get a score for the for the for the sequences match to this position weight matrix. Very easy. Of course, you know, then we have a question of what is a score of point nine actually mean, right? So there are additional layers on top of this that, you know, make it easier to scale a particular match. And one thing that we will do is we will take the sort of raw score that I showed you in the previous slide, right? So here you can see another example of G G G G, etc. Right. So G G G G G G C, etc. Right. And you get some score out of that just by adding up what's in the particular row in the right column of the other position weight matrix. If you want to know, you know, how, how good that score is compared to a really good match to the the matrix versus a really bad match to the matrix. The easiest way to do it is just to find out what the score is for the best match, right? So in this case, for this matrix, it will be 15.2. And the score for the worst match, which will be minus 10.3. And then you can just use that to put the absolute score you got in to context just by subtracting the the minimum score from the absolute score, dividing that by the range of scores. And so then you can see that this particular sequence matches the sp1 motif with it's 93 percent as good of a score as is possible with this model. And then you can make that something, you know, so that just tells you over the possible the space over possible matches to the model. Then something you might actually be interested in doing is comparing that against all of the matches you might find in the genome or other search space you're you're looking at. And you can use you can look at all of the matches, you know, you can genome wide that you might find to that motif. And you can compare, you know, where where you are versus where you are versus what is to the right on the curve, right? So how often do you find a relative score that is that is better? And you can convert that to an empirical p value just by taking the frequency of what what is under the curve to the right? All right. So where do where do you transcription factor motifs come from? There are a variety of different databases you can use. So the one that I think is most often used these days is called Jasper. It is open and free. You can download a lot of individual motifs that have been curated from the literature. And there are various other resources you can use as well. But I like I like using Jasper. And if you use any of the standard motif analysis software out there, especially things that use some sort of Web search, Jasper will usually be be an option. Any questions on this part? All right, let's move on. So the next part is how do we get these bindings? How do we, you know, get these motifs in the first place? Right? Like, you know, we're. I started with a set of motifs from Jasper. That came from somewhere. You know, you could you can take the analysis I showed you of the set of binding sites in the beginning and create a position frequency matrix from that. But unfortunately, things aren't going to be aligned at the beginning. Francis, is there a question or something? No, no, no, no, I just not sorry. Sorry for your problem. I just saw you pop up and I'm like, oh, yeah, I just wanted to be present more, I guess. So you knew somebody was watching your talk, I supposed a bunch of closed cameras. I see. So, so, you know, I'm grateful to the nine other people who I think have turned their camera on. So. Brave new you might you might you might encourage people to do the same. So, yeah. So the motif discovery problem is given a bunch of sequences that we suspect might have some sort of motif in in common. How do we find that motif? Right. So we want to find, you know, that motif or even a number of motifs. We don't know the width of the motif. We don't know where the motifs are, right? So it's not all given to us at the beginning, like like I showed you at the very beginning with the, you know, everything nice and aligned in columns, you have to kind of do that alignment yourself. This is hard because the inputs can be really long. They can be thousands or millions of base pairs, right? And the motif instances can be very short. And they can only they might only be slightly similar because the motif can be highly, highly degenerate, right? So I'll give you a, you know, I'll take that slightly abstract task and give you a somewhat more concrete example, which is let's say you do a gene expression experiment, right? You do you do an experiment where you, you know, knock out some transcription factor, for example, and you do RNAasease before and and after, right? And then you look at a set of genes that have been changed between the two different versions of the the experiment, right? And you want to figure out, you know, what the motif of something that might be responsible for for changing which genes are expressed? Did I? So actually, yeah, so actually, yeah, I was I was sighing. Sorry, but but it was a prelude to a question. So when you say in the previous slide, you said so the inputs. So the input you use could be quite very, quite a bit. Is in that, is that in the case of experimentally, it's large? Or is because you decide you don't know where to limit your your surf space or you're taking, I don't know, let's say a hundred KB in front of every gene or some of that is. Yeah. Yeah. I mean, is that is that basically it's not because you've done experiments that sort of point you to a million bases? It's because you've decided to look up at a million bases, right? Yeah, I mean, usually Moleons is kind of an outer outer. Yes, yes, yes, yeah. Yeah, no, no, but that but that's it's actually an important concept to so most transcription factors are within a hundred KB or within 10 KB, right? Yeah, I mean, if you were going when we're talking to do exactly what I'm talking about here, like 10 KB is probably the outer limit of what you would look at upstream of a of a gene. I'm not going to say human gene. We're talking about human gene. Yes, yes, not to say that, you know, a relevant transcription factor might be a hundred K away. Like that is quite likely that there will be a transcription factor. Yeah, I think it is still an answer. But at that point, you know, the the haystack gets a little too big and gets harder to pick that needle out. So usually when people are doing the sort of analysis I'm describing here, they might be looking at 10 KB upstream of the gene max more often probably like five thousand or or two thousand base pairs or or or something like that. Yeah, there are other. There are other environments in which you can you know, do this motif discovery problem where, you know, instead of I give this example that you can say use a lot of chip seek data and just look at all of your chip seek peaks, right? Or you can look at all of the you can do differential open chromatin analysis as well, right? Any any sort of experiment where you end up with a set of regions of the genome that you think have something to do with with the the transcription factor that is bound. You can you can search search within them. But the challenge in those is relating it back to the gene of interest. I mean, you know, this this this analysis I'm describing here, there's not necessarily, you know, kind of have a model that there is a, you know, one transcription factor involved. But really, this is kind of free of the information of, you know, which gene might actually be causing the change in transcription. You're just trying to look for little DNA words that might be involved in whatever. But in the example, I don't want to sort of get out and take too much time here. But the point you're making here is always you starting from known transcripts and then you're looking upstream of those at regions. And so you do know the gene, right? So you're looking as you're targeting that trust that DNA because it's it's an upstream of a gene of interest. Yes, in this case, well, yeah, this this case, it's a number of different. But yeah, yeah, yeah. Yeah, I think I was interpreting gene of interest slightly differently than you were using. OK, OK. But yeah, these are all really good questions. So thank you for this. OK. Another another challenge in this problem versus the transcription factors, you know, they they don't see the genome the way we do. You know, often things look the same to them forwards and, you know, backwards, right? What we think of as as being forwards and backwards is really, you know, quite arbitrary in a number of ways, right? So you need to look for both some motif and for, you know, some reverse complement of the of the motif, but also actually as a three dimensional space to. I'm not even getting into that. I'm not getting. Yeah, but that is an issue. Yeah, it is an issue. Yes. OK. So that yes. So that is not part of this particular problem, but it is. OK, is as a part of the problem generally. So our problem here is to discover the sites where the were just the motif given just the sequences that we we think are carrying some common motif in them. All right. And how do we how do we do that? We use a common, common approach in modeling, which is called an alternating approach. And the first step in the alternating approach is that we guess, right? So we have some method that gives us a initial position weight matrix, right? We can construct it randomly and we use that to find instances of the motif in the input sequences, right? And then we use those instances to predict a to identify new weight matrix and repeat this process over and over again. And so this is a a approach that is often used in various sort of modeling problems and various sorts of machine learning problems where you don't know the parameters to your model, but you have a lot of examples that you might train them from. Usually for motif elucidation, people use something called the expectation maximization algorithm, which is an alternating approach. I'm going to show you a slightly simpler alternating approach, which is called the gebsampler. But for your purposes, it's it's, you know, it's pretty similar in conception. So here is how we will exemplify the the steps from the little algorithm I told you on the last slide, right? We have a bunch of sequences, you know, instead of just randomly, you know, get instead of just randomly coming up with a matrix instead, we'll randomly pick, you know, little sub sequences within each of these sequences, right? And then we will turn those into a position frequency matrix and we'll turn the position frequency matrix into a position weight matrix, right? And then we will use that. We will, for example, take take out one of the sequences. Let's take out sequence four and just sample position weight matrix based on the sites we've identified for one, two, three and five. And let's score position four using this this matrix. So at the bottom here, you can see the results of what this fours are for that matrix on the sequence, sorry, on the matrix generated with all of the other sequences, except for sequence four. And this is a key part. We don't just say, you know, go for whatever has the the best score within the sequence. We will use, we will, we will proceed probabilistically, right? So we will turn our scoring landscape into probabilities and then we will pick with the probability of whatever is available at different positions, a sequence to go with for the next round of our sampling, right? So even though this is 52%, you know, this right here has a score of 20%. So if we repeated this, you know, randomly about half the time you would choose this sequence, but about, you know, a fifth of the time we'll choose this and we'll go in here and then we'll cross out sequence five and predict something, you know, we'll, we'll cross out the fifth sequence, you know, generate a new position weight matrix and repeat the same sort of thing on sequence five and do this over and over again. All right? So the fact that we don't just, you know, go for whatever has the maximum score here, keeps us from getting stuck in sort of local, local maxima. This doesn't actually guarantee that you will get the best global motif for a set of sequences, but it works surprisingly well. So people will use this or as I said, they'll use expectation maximization, which is harder to explain, but has some nice, nice properties. And they'll use that to, to generate various motifs. So once you generate one of these motifs, what do you do with it? So one thing that you can do is if you've de novo discovered a new motif, you can use Tom Tom, which is part of the meme suite. It has a website and there's also software you can run yourself if you want. And you can say, search your query motif against all of, all of Jasper. And it'll give you a variety of stats. So this query motif right here, you know, say, oh, it actually looks like motif 795 and Jasper and has a, you know, decent, decent E value. Anyway, that's one thing that you can do. You can also, instead of doing that sort of motif elucidation de novo discovery yourself, you can scan the whole genome with a set of existing motifs like those from Jasper. But I think it's important to know where they come from. I'll tell you a little bit more about scanning the whole genome in the next part. Are there any questions first? Any other questions? Okay. All right. How well does this model actually work? Right? So a variety of researchers who have taken transcription factor binding site models that come from position main matrices generated the way that I've described to you beforehand. And I've shown that, for example, most of the predicted sites are bound in vitro and Stormo and Fields found in biochemical studies that in vitro, the best weight matrices produce scores that are highly correlated with finding energy, right? So, you know, it's a probabilistic model, but it actually seems to fit quite well with what you might get in terms of a energy-based biochemical model. Does anyone see a problem with this? A key phrase that might indicate some problems with this model? In vitro? Yes, in vitro. You've seen this before, Francis. No, no, no, no. Sorry. I may have seen this before, but I had forgotten it. I believe that. It sounds like something I would do. Yes, that is the key phrase, right? So, this is an amazing model. It works super well in vitro. It's a really, really great model in vitro, right? Okay, so here's the bad. In vitro. I'm not gonna say anything anymore. And now I love this. This makes it more interactive. I think it's great. But you said it right. It's always about to say, you know, very unique and Ruth can't answer because I know they've heard this before. But anyway, so look at the MyOD transcription factor. In this paper, Fickett found that, you know, there's a prediction of transcription factor once every 500 base pairs of human DNA sequence, right? And here's the ugly. Let's look at, you know, once you have hundreds of different transcription factor binding site motifs and you scan just a single gene, you find that the whole gene is covered with matches to the transcription factor binding site, right? So this leads to what Wyeth Wasserman calls the futility conjecture, which is that the transcription factor binding site predictions are almost always wrong, right? There's a reason I present this lecture in the order in which I do, right? I come to the part where this is all wrong after I've shown you how everything works. Does anyone have any suggestions on what we might do to make our models more useful? That I do remember, but I'm not gonna say. Don't say. Any of the students, the participant. What's accessible? What's accessible? Okay, that's a great suggestion. You can see that times have really moved on, right? Because that's a right answer, right? And like, yeah, the first time I presented this many years ago, people would have given the wrong answer. Let's go to the wrong answer. Let's go to the wrong answer first. So one thing that people have thought of is, what if you just have a higher threshold on the model? That doesn't actually help, right? Because the true ratio of which prediction that actually represent real binding sites in vivo doesn't actually have much to do with, you know, the energetic factors of, you know, how the binding site works with the transcription factor. Once you get beyond a particular point, that is, right? So true binding sites are defined by properties not incorporated into the profile scores, right? And, you know, now that we have chip seek data that tells us in vivo where hundreds of transcription factors and other chromatin regulators are, you know, we can find that there are a lot of transcription factors where there's data indicates biochemical presence. And yet there's not a good match to the motif at all. So it's kind of, you know, we're kind of messed up going and coming, both that we make a lot of predictions that are, you know, invalid and that a lot of the places where we know there are good, you know, there's presence of the transcription factor, we can actually find a decent motif match, right? Let's skip that. So as Austin suggested, you know, there's other information, biochemical information that I told you about the context of the region that will tell you where binding sites are likely to actually occur, which you can use in concert with the sort of sequence-based method that I told you about before, all right? So where can you get this sort of information? One place you can get it is from software called Segway, which is from my lab. Segway is developed as part of the INCOV project and it has, it takes a lot of different information from things like, you know, attack or things like chip-seq and it does integration of data of multiple assays across the genome and then we'll allow you to define things like, you know, where transcription starts sites and where repressed regions of the genome, you know, and maybe you should be looking for transcription factor binding sites more in those regions that have signs of activity, right? So maybe you should look at them here, right? Where Segway says regulatory or you can look at it in this slightly broader region where it says, you know, transcription start site related or something like that. Anyway, if you go to segway.huffmanlab.org you can load it into the UCSC genome browser. But there are other sorts of contexts that matter as well. People are starting to look more into shape of DNA locally at particular positions, right? So we like to have this rigid model of DNA. In reality, DNA is not so rigid and DNA will change shape a little bit depending on what the context of nucleotides are, right? So you have what people call propeller twist. That's the thing in the middle here where, you know, one base in a pair will twist slightly different direction. You know, things like helical twist works kind of like this. All right, these are all things that can affect where transcription factor binding sites work. And you can kind of scan position weight matrices against things like helix twist and propeller twist and so on. You know, there are also a variety of methods like things like synapeed or things like hint that we'll use open chromatin footprints, methods like virtual chip seek for my own lab integrate a lot of different kinds of data. There are a lot of big challenges ahead such as understanding all transcription factors across the developing organism, understanding how genetic variation affects the transcription factor binding site, integration of more complex models and, you know, maybe transition from these simple matrix-based models that have worked so well but, you know, really don't model everything we know about the way transcription factors work together with models that can incorporate dependence on one position within the motif to the next, right? You know, finally, if you wanna get a picture of how transcription works overall, you have to deal with this massive amount of complexity and that it's not just an individual transcription factor binding like I showed you in the first slide but it is many transcription factors interacting with each other across three-dimensional space all simply to define whether individual gene is transcribed, right? And then, of course, there's still all that downstream stuff, right? How does the gene get spliced? How does the RNA degrade or not? How does it get translated? You know, so many things that can affect whether the RNA sticks around or it gets translated in protein and so on. So, Michael. Yes. Isabel from Wynne Hyde's lab asks a very good question about where sort of, where do you tissue specificity come into play? Where does tissue specificity come into play? Yes, that's a good question. So, yeah, so the interesting thing is, of course, those, you know, if you're using a purely sequence-based model, there is no tissue specificity, right? So you have to use things like the data from ENCODE that will tell you which parts of the genome are different and which tissue types. And, you know, if you use the latest generation of methods for examining transcription factor binding, like my own labs, my own labs for ChIP-Seq, you can incorporate that sort of information. But, yeah, it's not really done necessarily when defining the motif, but, you know, you can add that as a layer on where you look only in regions that you think are active in a particular cell type for the binding of a particular transcription factor. Good question. There are other questions. Well, I had another one which I thought you were going to get to, I guess, sort of taking into account evolutionary sort of relatedness of transcription factors between closely related organisms. Yeah, so that's another, you know, there's a variety of other directions we can go in from here. You know, there's additional information that we need to bring to bear on the problem, right? So sequence isn't enough. So one thing that you can use is evolutionary conservation as Francis points out, right? Which has pros and cons, right? So on the one hand, when you look at regions that are non-coding regions that are conserved across different organisms, often you will find they are conserved for a region, sorry, a recent, and that's a strong indicator that there's actually some transcription factor binding site that is important and under selective pressure there. So it can be a good way of eliminating false positives, right? But also if you look only at those regions, you're going to get a lot of false negatives. You're going to miss a lot of things, right? So it's important to realize if you're considering evolutionary conservation that conservation and selection don't really work quite the same way in non-coding regions that they do in coding regions, right? So coding regions are very, you know, very nice. You can make nice, nice alignments of, you know, of things. These are all A's, but these, let's, these are Alanians, right? Not, not adding, right? So like, you know, let's say that, you know, you have some region, it's some protein, you know, you might find across many organisms, things will often be very similar and where they aren't, they'll at least be in the same order, right? Like, let's say you have some region in the middle of some protein that's disordered, right? You know, this might not match across species, but the parts on the left and the right will match very well across species. This, this rule, this order doesn't really work when you're looking for alignment between things across species in non-coding regions, right? A, what I think is a better model is one that was developed by Duncan Odom and Michael Wilson and others, which is that there can be transcription factor binding site turnover. So, you know, let's say that there's an important selection reason for there to be a transcription factor binding site for a particular transcription factor upstream of some gene, right? You, like, let's say you need star and you need the square and you need the circle. Because of the way things work in three dimensions, these don't necessarily need to be in the same order, right? So over evolutionary time, you might find a redundant, a redundant star transcription factor, you know, rises up in the evolutionary lineage, right? And then this one isn't necessary anymore. So now the order has changed to star circle, circle square, which for a traditional alignment algorithm, you know, might look very different. It's not going to see these things as being related because they're on the wrong side of the circle and square, right? But for the transcriptional machinery in the cell, you know, there might be no difference between these things and it works just as well. But here you're assuming that the alignment tool you're using is looking for strings, the longest string, as opposed to the presence of three substrings at whatever order, in whatever order, yeah. Yeah. So if you are looking instead at which transcription factor binding sites you find in a particular region, you know, that might be a much better, better approach than say, just looking for, you know, a long, long string. Yeah. Let's see. I've already said much of the reflections here. Sorry. But you need to, no, no, no, it's fine. Like I said, I like these things to be interactive. It's a little more interactive usually when we meet in person. So I appreciate Francis adding this one.