 Okay. Good morning, everyone. Welcome back to the third day. I was going to make a joke about how the weather is like this year around, but I'm about 24 hours too late. The weather isn't like this year around. The weather isn't like the way it was yesterday or around either. So welcome again. I'm going to talk about gene regulation motif analysis and I'm told that people are feeling very free to ask questions in the middle and I hope you will continue to do that. Anything you have a question on, I'm happy to answer it or at least attempt to answer it. So we're going to talk about a few things here. In general, we're going to talk about transcription factor binding. We're going to talk about what the challenges are in predicting transcription factor binding computationally. At the end of the workshop, you should be able to identify binding sites of known transcription factors and at the end of the lab that Varinik will teach, you should be able to discover transcription factor binding motifs in genomic regions like, for example, chickpeaks or promoters using i-regulant. And the lecture part has several different components. We're going to talk about transcription in general, transcription factor binding sites, motifs, the models we use, and so on and so forth. But first, let's just start with what transcription is. So I run a lab that focuses on transcriptional regulation and this is how we think about the way that it works in eukaryotes. You have some region of DNA and then you have a transcription factor binding site, which we have here in green. There's a transcription factor, which is shaped so that it recognizes a particular transcription factor binding site. You have this one transcription factor and that binds RNA polymerase 2, comes along, and then RNA polymerase 2 produces RNA, and then that's it. We can go home because that's the most important part of biology, according to me anyway. All that phenotype stuff, the protein, the morphology, organismal behavior, that is strictly someone else's problem and not my problem, not something I'm going to talk about here today. Because this alone is quite a big problem. It's quite hard to even be able to predict when this is going to occur and when this is going to occur, as I'm going to talk about today. So even if you're just trying to figure out when transcription is going to be initiated, actually this model is a little too oversimplified, as it says. Because there are a lot of other things that go into transcription factor binding and transcription initiation, transcriptional regulation. All right, so this transcription factor binding site is often part of a promoter region. There's a core promoter, which is very close to the transcription start site. The core promoter often has some sequences that are necessary for transcription initiation, and there are regulatory regions, both proximal and distal, to the transcription start site. So there's some amount of influence of the distance between some transcription factor binding site and the transcription start site that influences whether there will be some sort of actual action in terms of transcription in that location. Things are influenced by a complement of different transcription factors that can potentially interact with each other. We also have sites that are more distal from the transcription start site like enhancers. Enhancers can either enhance the transcription of a particular gene in a particular cell type or not. There are a lot of factors that even at a single locus go into deciding whether there's transcription or not. And things get a lot more complicated because even though as a computational person in particular, over the previous decades, people like to boil the genome down into some sort of string. This is why it was attractive for a lot of computer science types to come into the field is you could just think of the genome as a string of letters. The genome is not a string of letters. The genome is a physical object and it's an object that folds back on itself in three dimensions and people are more and more understanding how to examine the three-dimensional nature of the genome and starting to understand how it influences transcriptional regulation. One thing about enhancers, for some time people were trying to understand why these distal enhancers so far away from transcription start sites can influence the transcription of those genes. It's because they're often either quite close in three dimensions or they're close to something else in the middle that's mediating the interaction. So there's a lot of different factors that can go into deciding whether transcription occurs. There's some other factors as well and these are things that people have started to measure over the last decade and a half or so under the category of functional genomics. Here is a diagram of a chromosome and we can zoom in through several orders of magnitude seeing sort of supercoiling of the chromosome. You zoom in even further and you see individual nucleosomes. You can zoom in further. You can see the DNA wrapped around the nucleosomes and the nucleosomes are made up of histone proteins. You zoom in further until you actually see individual DNA bases and RNA polymerase appearing and transcribing some of these DNA bases, which as I mentioned is the only thing that I care about. But if you care about other things they probably occur downstream from that somewhere. We can measure a lot of these things. So we can measure the structure. We can measure where the nucleosomes are actually positioned. So there are regions of open chromatin between nucleosomes. We can measure them and they will be say hypersensitive sites for DNA-seq where people can now find them with methods like a tax-seq. We can measure things about the histones. So the histones have tails that have a variety of post-translational modifications. So they can have things like methylation or acetylation. You can measure all of that genome Y with great specificity for individual cell types and so on. And we get a lot of specificity in measuring what the output of this machine looks like as well. You can see where RNA is transcribed, not just for individual genes, but you can see all sorts of non-protein coding genes. You can see what sort of introns are being transcribed at some point and then spliced out or sometimes not spliced out. Sometimes there's alternative slicing and that can affect downstream stuff as well. Using all of these different data, we can get a model not just of where the genes are in the transcripts, but also where cis-regulatory elements are, those nearby regulatory elements like promoters, like transcription factor binding sites that are near the promoters, and distal regulatory elements, things like enhancers or insulators. So in the terminology used here, people usually mean cis to mean something that is close by rather than something that is that is a distant. If you're alluding to the fact that terminology is used otherwise in other sort of sub-disciplines of biology, that is true. But people kind of became a little loose with things here in cis-general amines in terms of describing transcriptional regulation, I would say within 5,000 base pairs of the transcription start site. So a lot of these things that you can measure were pioneered by the ENCODE projects in 2007. There was a paper for the pilot phase of the ENCODE project where they measured a lot of these attributes for 1% of the genome, and then 5 years later they measured more of them, over 100% of the genome using what were then very new and exciting sequencing technologies. I guess now 10 years down the road seem a little older. But the key difference between this and previous genomic studies was instead of just trying to figure out what the genome for a species was, we were trying to figure out what the biochemical and biophysical state of the genome and particular kinds of cells are. So within a multicellular organism, each type of cell is going to have different patterns of open chromatin, different patterns of histone modifications, different patterns of where genes are transcribed, and so on. And this is one of the first attempts to comprehensively measure a lot of these things and a lot of different cell types. All right, so the ENCODE project has a lot of data. If you want to get access to the ENCODE data, you can go to ENCODEproject.org. A lot of the data is also available through the UCSC genome browser or Ensembl. You can also get any of this sort of data. Most journals insist that if you produce data like these data types, you have to put it on Geo, so you can go there. But that's much more of a repository for raw data. And if you want to actually examine these things, I would encourage you to look at, say, the ENCODE project website or go through one of the genome browsers that's set up for it instead. Later on, I'll show you some examples of what sorts of things you can find in the genome browser. Any other questions for this sort of introductory part on what transcription is, why we should care? Move on to part two. So part two is, in other words, teaching a computer-fine transcription factor binding sites. So let's say we have a way of measuring where a transcription factor binds. And there are a lot of different ways of doing this in vivo or in vitro. And you can look at a single site, and the single site will have a sequence that looks very much like this. For most transcription factors, finding something like that is not enough to give you a model that will allow you to predict where the transcription factor will bind. Because a single transcription factor can actually predict a whole set of binding sites that are similar to each other, but they're not going to be identical. And the amount in which they are similar to each other changes from position to position. So if we use our method for finding a bunch of binding sites, and we find binding sites that look like this, even if you're just looking at it by eye, you can see some similarities between different parts of this list of sequences. For example, you can see the first column is usually A, but it's not always A. It's sometimes C or G. Or you look at the fourth and the fifth column, which are almost always TT. If you kind of look two-thirds of the way down, you can see somewhere in the fifth column, there's not actually a T there. So there's some amount of variation that's allowed, but you'll still find things. You'll still find a transcription factor will bind to a site, even if it's changed a little bit off of what is seen as this consensus. So there are a number of ways you can represent the transcription factor's preference or motif. So one is you can use an alphabet that gives you wild cards. So how many of you have seen the convention where you use R to mean A or G or Y to use C or T? You all seen that? About half of you have seen something like this. This is something you can do. So IUPAC has published a set of rules for what the different letters mean. And there are letters that you can use to mean not necessarily A, but maybe it's A or G. And it turns out they've actually created a whole alphabet that you can use to represent every possible combination of A, C, G or T. So we have V here with the first column. V means A or C or G because you never see T in this first column. This fourth column is always T. But the fifth column I mentioned is T, but in one case it is A. So they use W. W is A or T. The W stands for weak. So A or T are weak. C and G are strong. So those are S. You don't need to write that down. I'm just telling you there is a logic behind where these letters come from. If you want to, you can look it up. This is somewhat inadequate, though, because we're representing this first column as V, which means A or C or G, and the fifth column as W, which means A or T. Can anyone tell me a qualitative difference between the first column and the fifth column? No, looking at the notes where this is already explained, because that would be cheating. Anyone? What's that? Yeah, how often they appear, right? Because the first column, it's usually A, but it's not a fairly regularly versus this fifth column where it's almost always T. So you're a lot more surprised if you see something that's not T in the fifth column versus the first column. So we want a way to represent that as well, and the simplest method is called a position frequency matrix. So in the position frequency matrix, we have a number of rows for each letter in the alphabet, and we have columns for each column and our set of binding sites. And we simply count up the number of times we see each thing in each, sorry, each letter at each column. So the first column, we have 14, 3, 4, 0, T's. This fifth column, we have 1, A, 20, T's. The fourth column, you have 0, A, C, or D, 21, T's, and so on. And to the first approximation, this will incorporate what is known or certainly what we think are the most interesting features of this set of binding sites. And once you have things in this sort of matrix form, you can turn it into a sequence logo. So if you look in the literature, most of the time when people are talking about motifs, they don't list a single site. They don't list a consensus sequence with ambiguous letters because of the problems we mentioned, or at least they haven't been doing that since the early 90s. If you do go and look in old papers, you'll often see a lot of use of R or N or W or S to describe different things. But since we've had widespread access to computers, people have mainly switched to this. And this is a nice representation because the human eye can intuitively understand parts of this in a way that you can't understand the matrix without staring at it and thinking about it a lot. So for example, you can see this fourth column, the T is bigger than the fifth column. That means the fourth column, it's more important to T. It's always a T in the fourth column. Anything that's not a T won't match to this. Whereas the fifth column, there's a small percentage, a chance that it might be A. So this is one way we can represent a motif. Any questions about this? Is it that based on humans? So let's say you suspect a sequence to be a biomotive. Are you comparing it to other sites within the human genome or also other species? This is just kind of the general, this is one version of the model. I'll talk about how you identify the sites using this model in a minute. Is there another question? Okay. So if you see something like this in the literature, I misled you guys a little. It won't usually won't actually mean this. So the prevailing matrix model that people use is actually a little bit more complicated. So this is the simplest version, the position frequency matrix. But what people have usually been using for the past three decades or so is something called a position weight matrix where people also describe it as position specific scoring matrix or PSSM. These terms mean the same thing. And the position weight matrix or PWM is a transformation of the position frequency matrix. So let's start with a position frequency matrix for a simpler set of sites. We only have five binding sites and it's a lot smaller to make it easier to see what's going on. And we're going to do three things to make this model more useful. So the first thing that we do is we're going to take each one of these numbers and we're going to correct for the nucleotide frequencies in the genome. So essentially we will divide by how often we see a particular nucleotide in the genome overall when we're scoring a match. So this is species specific. But basically what we're trying to do is measure with this model how surprised we are to see a particular string of nucleotides. And if you're using say an AT rich genome, if you're using a genome where 70% of the base pairs are A or T, you're going to be a lot less surprised when you see a string of nucleotides that have As or T's in them. So you get kind of additional points in an AT rich genome for having C or T in your motif or vice versa in a CG rich genome. The second thing we're going to do is we are going to wait for the number of samples that went into the position frequency matrix. So we only had five samples going into this one versus 21 we had going to the previous matrix. You want to incorporate that that's going to give you some amount of confidence in whether the motif that you've come up with is actually representative of what the transcription factor binds. So you only have five observations. You don't want to rule out something that's never been seen in these five observations. Maybe if you took six observations, you would find a T in the first position. Maybe you'd find an A in the second position or something like that. The way we do this is we add what's called a pseudo count, right? So a pseudo count essentially is a way of getting rid of any zeros in the matrix. And a common strategy for pseudo counts is you just add one to everything. So you add one to everything that kind of smooths over any sort of sampling noise you might have had in the first place, right? It'll turn something like this into six, one, one, one, right? Which means that you don't get infinitely penalized for having something that hasn't appeared in what you've seen so far. Because basically the way we're going to score these things is multiplying the frequency by each other. So you have zero anywhere. That means that you are, you know, the scoring will infinitely penalize anything where there's a zero. So we get rid of all of the zeros. By just adding one, it changes the way that it affects PFMs with smaller samples much more than those with bigger samples, right? So this, so if we started with five here and we add one to everything, it changes the ratios from 100% zero, zero, zero to, you know, six out of eight, one out of, we can't do math, six out of nine, sorry, six out of nine, one out of nine, one out of nine, one out of nine, right? If we had say, you know, 96 samples in the PWM, then instead of being, you know, instead of these other things that we never saw getting one out of nine or 11% probability, they'd instead get one percent probability, right? So this is all the way of eliminating zeros and allowing you to kind of have some effect on the confidence of your matches based on how many samples you had to start with. And the final thing we're going to do is we are going to take the log of all of this, right? And that's, that's simply to make the, the arithmetic easier for the computer. And I know most of you probably think that computers are pretty fast at doing things like multiplication, which they are, the computers are much faster at adding than they are at multiplication. And if you're going to do something like this over and over and over again, it's, it's often a lot faster for you to convert something to log, log space first, and then you can just add things. You add things in log space and you exponentiate them out of log space. And it's the same thing as if you've multiplied them together, right? So if we do all of this with this PFM, we get this, this position weight matrix or position specific scoring matrix, right? And you can see it's quite interesting, you know, now, now the numbers aren't zero or not. The numbers are actually, you know, negative or positive. So having something that matches the motif well will give you a positive score at that column and having something that, you know, maybe you never saw before, gives you a negative score and so on. Right? So here's how you would score the site TGCTG against this position weight matrix. You simply look at each one of these columns TGCTG, you find the appropriate row at that column, you get the number, you add them all together, that's your score for a potential finding site. Yes. You get the minus, sorry. How did I get the minus? Okay, so basically if we take, if we take these numbers and we add the suit account, we divide by the genome-wide probability and we take the log, anything where this is less than one is going to be minus, right? So if you take all this, divide by the genome-wide probability and this is exactly one, then we get a square of zero. If it's less, we get a square of negative something. So you can get a sequence with this negative score, like you can get a string. That's right, that's right. So if we had, you know, say TGCTG, that would get a negative score. Right? Other questions about that? So you kind of have to put somehow that sort of, that sort of square range into context, right? Because, you know, this particular matrix has one score that is most negative and a different score that is most positive and that's going to change from matrix to matrix. So what we can do is we can take a sort of raw score, right? So the raw score is something to measure like I just told you, right? So here if we have a site, we're scanning the genome and we're looking for matches to this SP1, SP1 matrix, all right? And so we'll, we'll examine every single window of the motif width, right? And here's a window that starts GGG something and you count up GGGGGGC. We take all of these numbers, we add them together, all right? And we get a raw score of 13.4, 13.4. We want to convert this to a score that puts this in the context of how, how well or how poorly you could do with this motif in general, right? And the way we do this is we find the maximum possible score for the motif. So what's the best possible score at every column, all right? So here for this, it would be something like GGGGC, GGGGT. And then we can do the same thing with the worst possible score, which is minus 10.3. And all we have to do after that point is take the raw score that we had in the first part, subtract it from the minimum score and then divide that by the distance between the maximum and the minimum score. And that gives us a fraction or a percentage as to how close the raw score we got here is to the, the top score possible with this particular motif, which in this case is, is 93%. You see all these G's at the start, you know, you see a T at the end, those are going to make things pretty good. Next time I need to work on the highlighting for this, because that's kind of messed up. Any questions about us? Yes. Yeah, good question. Good question. How big an area are we searching? So you can search the, an entire genome for something like this, right? So now, you know, when we converted things to log scale and makes things a lot faster to do this sort of scoring, right? So if you use the human genome, which is about three billion base pairs, then you have to repeat this on the order of three billion times, right? So you do start here, start here, start here, start here, start here, just slide over one each time repeated over and over again. There are other ways of doing this where you might limit to particular parts of the genome. I'll talk about that a bit more later. Oh, in this case, that is a good question. That's a good question. We'll get into that a bit later. But the key thing is that we can, we can score everything, right? So one way of representing this, and the way I usually tend to represent things in functional genomics in general is in terms of a signal, right? So you can simply, you know, every base pair you can, you can write down the signal or you can plot the signal for, for how, how good the matches that, that point, right? And then you're going to need to take, if you want to limit to say top 100 sites, you can pick the top 100 there, right? Another way that you can do it is by turning this into a P value, right? And so you can then, you can then take some sort of preset threshold, like maybe you want to look at things that are the, where the P value is less than 0.05. All right. And the way that we do this is by looking, say across the whole genome, we look at the different relative squares we get and how, how frequent they are, right? And to get this empirical P value, basically we, we look at every, every position, how, you know, how big is the area to the right of the curve if we add up the frequency of things with a particular score, versus things at that position, right? And so you can, you can limit to the top 5% in the sort of empirical P value approach that way. But we can also do more complex things, which I'll talk about in a little, a little bit. Any other questions? Yes. So, so usually the way I think about this is in terms of not just looking at one, one sequence, but scanning across a whole range of sequences, right? So instead of, you know, interrogating for one sequence, you might say, find all of the, the potential binding sites in the genome or rank. So they're on the order of 3 billion different possible binding sites, rank them all from 1 to 3 billion. And which are the top 5%? So it's the entire genome basically. What's that? It's kind of the entire genome. You can scan the entire genome. Yeah. Do you have a question? So how is this different from global alignment? Yeah, so in general, so one thing about the way this is, this is usually done is when people do alignment in most embodiments, they, they allow gaps in the, in the sequence, right? So, you know, you can, you know, have some sequence in one species, align it to another one and alignment algorithm will be able to, to deal with the fact that sometimes you'll have little insertions or deletions in one, one species or, or another. You mentioned, you mentioned profiles. This is, this is much less of a, a pair wise problem, much more of a sort of profile problem because essentially what the PWM is a profile that originally came from a, a set of binding sites. So this is a particular kind of profiling and, and scoring. There are other ways of constructing models for, for, you know, particular kinds of motifs that you expect to see, right? So there are things like profile HMMs, which will find particular domains within proteins for amino acid sequence. For, for various reasons, people have found that this much simpler model seems to, to work well enough or as well enough as is expected, than using something more complex like a profile HMM. Although people have investigated more, more complex models for how the sequence, the sequence predicts a, a binding site. It's believed by many, including myself, that that's not, that's not the biggest problem in terms of understanding where transcription factors bind. There are other questions. Yes? Yeah, you can do it, you can do it both ways, definitely. The thing is, well, let me, let me talk about this and I'll, I'll get back to your question. So, so one, you know, I've, I've, I've shown you guys a bunch of these things, right? And it's, it's relatively, relatively simple. Given that you have this sort of model, right, if you have this matrix, you can score any, any genomic sequence, right? If I, if I give you a, a sequence, you can score it with this matrix, even if you only have pencil and paper, right? You definitely can. Where did this come from, right? So this, so, you know, we have, I mean, I told you how you, you get these position frequency matrices, but if you're doing this experiment, right, where you get your position frequency matrix form to start with. So there are a variety of databases and the, the, the most widely used one is called Jasper. And Jasper's assembly of position weight matrix matrix. So they've already been transformed from PFM to, to PWM. And these come from different kinds of data. So there's a variety of in vitro data. So things like protein binding microarrays, Clex, high throughput Clex, and also some in vivo data types as well, things like, like chip seek. So if it's important to you, whether, whether the data we're getting is derived from in vitro or in vivo sources, make sure you check this when you use, use Jasper. But Jasper has hundreds of transcription factors and their position weight matrices associated with them. And so you can download Jasper and you can, you can use it. If you're interested in a gene, you can use it to scan that gene's promoter and see what potential binding sites you, you might have. If you want to go the other way around, right? If you have a, a gene you're interested in, and you want to know where it's binding, it's a little less simple, right? So if your gene is amongst those 200 or so transcription factors that are in Jasper, then sure, you can do the same sort of process genome-wide, right? But if you're in the 1200 or so other transcription factors that aren't there, then you can, then you have to actually do the sorts of experiments that I talked about, either high throughput Clex or protein binding microarrays or, or chip seek, because then you will have to get that information and hopefully publish it in a way that at some point someone will, will put back into Jasper and then other people will use it inside, inside your paper. Did that answer your question? Okay. All right. So I'm going to turn to this, this again, right? So now I've said, you know, we, it's relatively easy to calculate this if you, if you have a set of sites, you just put them all on top of each other and then calculate how often you see things in particular columns, right? And I told you the different kinds of experiments that you can use to get them, but it's actually a little more complicated than that, right? So one thing that we kind of glossed over is the fact that different, whatever experiments we're, we're using, they're unlikely to give us set of binding sites right on top of each other in the, in the exact way that we want, right? So here let's switch briefly to the, the protein amino acid motif problem instead. So it's basically the same problem we use positioning matrices, but you can just have an alphabet of 20 symbols instead of four, one free torn amino acids, right? So we have a problem which is if we're given three sequences and we hypothesize that there is a motif in common, a short motif, and these three sequences, how do we find them, right? And let's say this is the answer, we have these three, three sequences and they all have very similar sequences to each other, but they're, they're all within this much bigger, bigger sequence. So we have to find this needle and this, this haystack somehow. We can rephrase this a little bit more firmly. So if we're given a sequence or a family of sequences, we want to be able to, to find any motifs in common between these, we want to be able to find out how many motifs there are. We might not know the width of the motif, we want to know the, the width of each of those motifs and we want to know locations of all the motif occurrences, right? And so this is, this is actually pretty difficult because the input sequences can be really long. I mean, thousands or millions of base pairs or residues and a motif might be fairly, fairly subtle, right? So if we wanted to actually find exact matches, if we wanted to find all of say the words of, of eight characters long that are exactly the same between several, several sequences, this is actually fairly, I won't say it's easy problem. It's a straightforward problem. You know, you can go in the computer science literature and find, you know, solutions to this problem that are very, very fast, right? It's the degeneracy of biological motifs that make, that in part, that makes this, this challenge, so the fact, challenging the fact that you have these sequences that are similar, but they aren't exactly the same. And, and how much you're going to permit changes differs from column to column, right? So to switch back to transcription factor binding sites, give you a little short example here. So here's another, you're asking about this sort of environments in which we, we might use this problem actually more likely than, than looking genome wide is if, is we have done some sort of experiment where we know a set of genes that are tied together in some way, right? So for example, you've done a gene expression experiment and you know that these genes are upregulated and these genes are downregulated, right? Can we find a set of motifs in common amongst say the, the co-regulated, co-regulated gene promoters, right? So we can quickly move ahead, you know, I'll show you the answer first and we'll go back and talk about how we can get it, right? So there's a motif here, which is A-A-A-G-A-G-T-C-A and you find it a few times and you also find a couple of times when you find it in reverse, right? Reverse complement, you know, starting on this side, it's going to be T-T-T-T-C-C-T-C-C-A something like that, right? And there are a few cases you can see things are slightly different from each other. We need to find all of these things without having them pre-specified for us as I've, as I've done here, right? So in real world, in the real world, they aren't pre-marked in blue. So the way that, the way that this is typically done is using something called an, an alternating approach. And so alternating approaches in general are, are quite common in, in discoveries in probabilistic models, which a position weight matrix is, right? So position weight matrix is something that, that gives you a probability of binding, basically. And if you want to actually come with the parameters for that, that model, one way that, that we use to do that sort of thing is we simply guess, right? So we guess and then we refine our guess. And we might do that a bunch of different times and see how good the guess is at the end. And we use alternating approaches when the problem space is too big to study exhaustively, right? So if we, you know, if we were using sort of exact matches, it might be easy to, you know, or even things like the consensus sequences I showed you before for the choices A or C or G versus A or C, we could consider these things exhaustively, right? But in this case, we're going to give a little, little score to every match of a symbol and a column. And that score can change ever so slightly. And that can have a big effect on the probability we can't consider it exhaustively. So this alternating, alternating approaches used instead. So starting from our guess, so we get a guess get weight matrix, we use that to predict instances in our input sequences. So we see how, how well our guess matrix matches the input sequences, right? And then we recalculate the weight matrix as if those instances were given to us as a pre-aligned set of sites as I showed you in part two. And then repeat this over and over and over again, right? So a variety of different alternating approaches. I'll show you in a little bit more detail here, one called the Gibbs sampler, because it's easier to explain. So here's one thing we might do with a Gibbs sampler is for each one of our input sequences, we will just, so instead of just guessing a weight matrix from, you know, total empty cloth, we will just randomly pick a subsequence within each one of these sequences and take that as an initial set of sites to calculate a position weight matrix from, right? So we've picked these totally randomly, like there's, there's, there's basically very little chance that these randomly picked sites have anything to do with each other. But what we can do is then refine this and repeat this process over and over again. And as we go through the process, if we have actually found, found a set of sites that actually match some sort of interesting motif, the score will continue increasing as we go through this process, right? So this, this set of sites that I gave you from here will, if we take out the fourth row, will give us this position frequency matrix, we'll skip the conversion to position weight matrix just to make things a little bit simpler here, right? And then using this matrix, we can then score sequence four that we've dropped out, right? And so if we do that, this is the part that matches this, this weirdo position weight matrix that we, that we basically guessed that we, that we came out of nowhere that was random, right? And then there's some other places that might match it as well. And so this has 52% probability, this has 20% probability and so on. One thing you could do is you could simply take the place as the highest probability and stick to that. That is something that is, that will probably get you to some sort of local maximum, but it's not going to be a global maximum for a motif that can be scored in these, in these sets. So instead, we will take all of this as a probability and we will, we will pick, we will pick a new subsequence in each of these places with a, with a weighted probability, right? So there's 50% chance we'll pick here, but there's 20% chance we'll pick here, and there's still, you know, 0.5% chance that we'll pick somewhere here on, on the margin. All of these different things are possible, which means we are going to consider a, a large number of paths, but those paths which seem to have the highest probability will be considered more. So all of that makes sense? We do this. So we picked, you know, we rolled, we rolled the die, the die came up sick. So, you know, we picked this region instead of this region, and we put this sequence back into here, and then we're going to repeat the, repeat the same process. You had a question? Yeah, that's a good question. Why not start from the beginning and then, then scan the whole thing? So the problem is that we, we don't have time to do that. So you could, you could start from the, the beginning here, and then try to do that here, here, and here, but let's say each one of these sequences is 5000 base pairs long, right? The number of different potential starting places you're going to get is going to be 5000 times 5000 times 5000 times 5000 times 5000. Pretty soon, that's a really, really big number and something that we can't, we can't do within the lifetime of the universe. So that's essentially why we have to, you know, try to make the problem easier by, by guessing, right? And hope, hope that the way that we, we wait probabilities means that we are more likely to get to the, the right answer than if we just, you know, sort of randomly pick things without, without waiting in subsequent steps. Yes. So it could be random. A way that, that, that is often done, right, is like, you could do it randomly. You can also kind of cycle through these, right? So you might do, you might do four and then five and then one and two and three and repeat that, repeat that ever so often. You might do it randomly. You might do it in a way that you, you have some sort of waiting as to which one you pick or you might do the one that, that matches things the worst or matches things the best or second worst. So there are a variety of ways you can, you can deal with it. At a certain point, this kind of becomes, you know, the actual process. Sorry, even while, while scoring a particular region from a motif is an exact science, the process of getting those motifs in the first place is, is, is a bit more of an art and there have been a variety of different approaches that people have tried over the years, you know, and at some point, I think people stop investigating some of these things and, you know, said, yeah, it probably doesn't matter very much exactly how you get rid of, of SI, which, which one you pick and so on. Other questions? Yes. No, let's talk about that later. We'll talk about that later. Yes. You don't, you never know that. Yeah. So, so that, that's the problem with alternating approaches, right? So you can never, you can never do it exhaustively. You're, you know, basically, there's a huge sample space you can consider, right? You're trying to sample as much of that sample space as you can get. You want to do it in a way where you will hopefully spend more times in the part of the sample space where the, the highest scoring motif is more likely, right? But there's always the possibility that you will, well, instead of finding a global maximum, you will find some sort of local or regional maximum. And the most you can do is, is keep trying at this. However, if you do something like this, if you do something like this approach and, you know, you set it off on your compute cluster and, you know, run it for a few hours, right? You'll find your, you know, highest scoring motif gets to like this, right? And then, you know, you run it for another 12 hours and your, your highest, your highest scoring motif goes up and in score of like 0.001% over the, you know, the previous high scoring motif, you're probably okay to stop there. It probably doesn't matter to, to keep going further. Any questions? All right. So now that you have a motif, what do you do with it? All right. So if you're doing de novo motif discovery, as I've, I've described here, first thing I usually like to do is use a tool called TomTom, right? So TomTom is a way of comparing a motif against existing set of motifs, such as Jasper, and it simply tells you which motifs match this motif, right? So you might find, oh, the motif, you know, it's very exciting to do these, these, these sorts of analyses and you find something that looks like there's a really strong motif in your co-regulated, your co-regulated gene promoters, right? And then you look through the database and it's like, okay, well, it matches, it matches some existing motif very, very well, which isn't totally a letdown because now you have a hypothesis about what might be involved in the, in the co-regulation, right? But it's, it's relatively rare for people to come up with true de novo motifs from this, this sort of, this sort of thing. That's not to say, from this sort of thing on something like a gene regulation analysis. If you're actually doing something specifically targeted to a transcription factor, like chip seek for that transcription factor or CLX for that transcription factor, of course, it's not rare to find something new at all. No one's done it before. Okay. So this is the point where we start to talk about, you know, I've given you this, this lovely mathematical model. How good is it? So in many ways, it's very good. All right. So people have done some, some experiments, right, 1997, 1998, this is one of the stuff that's being developed, right? They used various transcription factor binding sites, various motifs that have been discovered using the methods I just told you about, right? And they've found that, you know, most of the predicted sites in vitro were, were bound. They found that PWM score produces a score that's very highly correlated with, within vitro binding energy, right? There's a phrase that I keep, keep using over and over again. Watch out for scientists when I keep saying a phrase over and over again before all of their claims because, because it's start, start wondering about whether that phrase is essential for the claims. Anyone name the phrase that I keep mentioning? Anyone? Did you hear it? Is there a phrase that you see in both of those bullet points that might qualify this results? In vitro. In vitro. Keep saying in vitro, right? So if you look, instead of looking in vitro, if you look at, say, a whole genome sequence, right? So you look at, say, a myOD transcription factor binding sites, you might find a good, what is a good match every one per 500 base pairs of human DNA sequence, right? Which means a lot of sites per gene, for a lot of different transcription factors, right? And then here's the, here's the ugly, right? So here's one gene, actin, use it with a set of binding profiles, and you find that there's, you know, transcription factor predicted to be bound pretty much everywhere, right? And this is with only, you know, 1 to 200 profiles for transcription factors when they're actually 1400. So if you can continue this, this would be totally saturated. And you would find that there's transcription factors absolutely everywhere, right? Which is not actually the case. And people say, so people say that transcription factor binding site predictions are almost always wrong when you look at it in vivo, right? So there's a reason I do this lecture in this order, right? So I tell you about all those stuff. And then I tell you that it's, that it's all wrong. I promise when I first started doing this, I didn't do that, do it in that order. And people were, were very confused. So the first thing people think about is, okay, why don't we just turn the, the score up, right? Like, let's say you were using a p-value of .01 threshold to decide whether something is a, is a good binding site. Why don't you, you know, make that much more specific, maybe look for a p-value of, I don't know, 10 to the minus 6, right? It doesn't, it doesn't help, right? Because the problem is not that the, the model is bad, right? The problem is not that, that, you know, we're not being stringent enough in deciding what should pass this model. The problem is that the model is designed for this, this simple in vitro case where you have a piece of naked DNA and you have a single transcription factor and it does a really good job of predicting where that transcription factor is going to bind on that naked DNA, right? But if you look in a eukaryote, it doesn't, doesn't look like that. You've got lots of other stuff going on, right? All that stuff I told you at the very beginning of the, the lecture makes things more complicated. You have regions of open chromatin and closed chromatin. You have histones getting in the way. You have, you have, worst of all, you have all of these other transcription factors that are cooperating and competing with each other, right? So if something is in a region of heterochromatin, most transcription factors aren't going to be able to bind it, right? They, they, they simply will not get to it. And so these are the things that you have to consider. So this is, this was kind of, you know, where people thought about this problem a couple of years ago and they came up with, with some solutions to this, right? So in the last two years, people have realized actually that it's even worse. It's even worse than that, right? So we have a futility conjecture. Most of these binding sites are wrong. So this is something, this is from a recent preprint from, from my lab on something called virtual. You can see, you look at all of the biochemical data from ENCODE, you know, so a couple of hundred transcription factors, and you look at how often identified, biochemically identified events of presence of a transcription factor, how often you find the motif of that transcription factor in a database at quite a loose threshold, you find that for some of them, yeah, like CTCF, it's more than 77% of them, right? But it goes down over the transcription factors. McGreen are the ones where less than half of the biochemically identified events where there's a transcription factor present actually have, have the motif, right? So, and it's even, it's even worse than that. So you can, you know, look at, here's one of the examples of, you know, Math K, right? So Math K has a pretty good, you know, pretty good match between the motif and the actual binding sites, right? So right here, here's Math K. It's almost 75% of them have it. If you look at the peaks and you aggregate over all of the possible peaks, generally you find the motif near the center, right? And there are other things like, what is this? ATF3, and ATF3 is, you know, basically flat line. There's basically no connection between the motif and all the biochemically identified positions, right? So we call this the dual futility conjecture because it's kind of like futility conjecture. Most of these, these transcription factor binding sites you predict aren't real. The dual futility conjection says, and a lot of the things that are real don't have a good match for a transcription factor binding site. So how are they real? That's why I don't understand. How are they real? Yeah. So how do I know that they're real binding sites? Yeah. So, and there's a limit to how much you, you know, now we start getting a philosophy and, you know, epistemology, how much do we really know about anything? But which, hey, when you're, which is an important thing to consider when you're doing stuff that is complicated and has a lot of places for, for noise to slip in, like, for example, the chip seek data that I'm using to conclude what is real and what is not, right? So it's not exactly the same thing as, you know, looking through a microscope, right? If I look through a mic, if I'm doing light microscopy and look at something, right? I have a high degree of confidence that what I'm looking for, what I'm looking at is probably actually there and not some sort of weird artifact, right? So chip seek, which is how we do this, it has weird artifacts, right? So amongst other things, the biggest source of, of artifact here is this is another one of those things I was warning you guys about that, you know, scientists switching their vocabulary and starting these particular words, I stopped using the word binding and I started talking about transcription factor presence, because chip seek doesn't really measure binding per se, it measures whether transcription factor is there. You can have your piece of DNA and I call this binding, right? But sometimes the transcription factor is like nearby, maybe it's interacting with another transcription factor, which is acting with another transcription factor, which is actually binding and chip seek might pick that up, right? But the problem is we do not as of yet have data that really, that allow us to clearly distinguish these things in most cases in vivo, right? And in most of these cases, this is the best biochemical data that we have on where transcription factors are actually binding. And what's more often it's the data that's used to define these motifs in general. So I think that a lot of the limitations that we have today are not actually limitations of computational methods or of the way we do mathematical modeling, they're actually limitations in the sort of data that we get, that we use to build these models, and we use to benchmark these models in the first place, right? They're very noisy data. People make assumptions about what the data mean, assumptions like this is binding, when in reality, you know, those assumptions aren't met by reality. These are also, you know, experiments from bulk cells, right? So you have millions of cells, right? You can find things that are sometimes happening and not happening other times. It's harder to decompose those things as well. So this is the depressing part where it's depressing for, I don't know about for you guys. For me, you know, I'm the one who said this whole problem, the thing at the beginning is what my lab focuses on, what I've dedicated to my career, and I basically told you guys that it's impossible today. So what have we learned about the wreckage of my research program? Just kidding. PWMs can accurately reflect in vitro binding properties of DNA binding proteins, right? But they occur far too frequently to reflect in vivo function. That can occur without a strong motif. Strong motif can sometimes not mean that it occurs in vivo. And here's why my research program and all of those who work in similar research programs is not actually dead. We need to use additional information to help us distinguish between these cases. You aren't going to do it with a model of sequence alone. You need something else, right? So you mentioned genetic conservation. You mentioned you mentioned genetic conservation. Sorry. That might be one thing that we can incorporate. We can incorporate a bunch of other things too. We'll talk about this. Any questions? Okay. Moving on. We will actually get to some of the some of the other things that you can use to incorporate in a future section. But let's talk about the simplest thing that you can do is stop treating these individual potential binding sites alone, right? And start seeing things in terms of a larger experiment. So you can kind of amplify the signal that you might have over a large number of regions. All right. So we go back to this question that I posed earlier. How can you find out which motif is responsible for potentially responsible for a set of co-regulated genes, right? One thing you can do is look for how often do you find that motif in the whole set of co-regulated genes above a set of negative controls, right? So I've told you that it's quite likely that you'll find all these binding sites by chance when you look at the genome of Ron. It's harder to interpret one versus another, right? But if you find something like all of your genes have one or two good matches to this motif, and all of your negative controls have zero matches, that is actually kind of a phenomenal result. It's so phenomenal that you will never see it in real life. But you can set up a statistical method that will tell you how close you are to something like this, right? Or how close you are to something where you're just getting what you're expecting. So iRegulon is a method that can do stuff like this, right? I'm not going to talk much more about iRegulon now because Veronique is going to teach it to you in a few minutes. But amongst other things iRegulon addresses this problem across a whole set instead of looking one at a time, right? Another thing that you have to contend with is sometimes you might have a good idea that some transcription factor is binding a set of motifs, but you have to deal with a lot of motifs that are actually very similar to each other because there are specific families of transcription factors where the motifs are very, very subtly different, right? And there are things like top gene. If you're experiencing this problem with ETS genes or say gotta genes, I would encourage you to look into top gene, which can be used for prioritizing things, right? Now we can talk about incorporating some of the other information that we might have. So moving beyond just sequence, what sort of other biochemical data do we have available to us, right? So a lot of what we have is that data from say encode projects and things like it that you can add into what you have here. So encode collects data for a number of different cell types, right? So between encode and roadmap and other international human epigenome consortium data sets, there are a few hundred different cell types that you can look at and you can download various kinds of data sets for the cell type that either matches what you're doing if you're lucky or something that is close to it. If you're less lucky or if you're very unlucky and you don't have anything close to it, then you might want to download for a number of different data sets and see if you can find things that are, you know, constitutive biochemical events across a lot of different cell types or so on. Problem is encode has say, you know, it's an order of magnitude of tens of thousands of experiments and data sets. To download, you probably can't download to your workstation if you try to download it all to your cluster, if you have one your sys admin would be very annoyed, right? So we need to reduce this data somewhat. One way you can do this is using methods for what are called semi-automated genome annotation and segue is one of these methods where you take a lot of data sets that are defined across the genome. So say a chip seek data set, find in terms of genomic signal and find patterns in those. So different different tracks being high or low at different places in the genome, right? And use that to automatically annotate the genome with things such as, you know, functional categories such as gene start, gene middle, gene end, enhancer, and so on and so forth. If you have some sort of prior knowledge from these experiments that a particular region, like this region is the middle of the gene, that this region looks like it might be an enhancer and you want to find out what might be, what might be responsible for control of that gene, you want to focus on the part that you already suspect is kind of like an enhancer, right? So this lets you do stuff like that and also lets you do things where you can, you can, you know, make comparisons between say all of the enhancer regions you can do the same sorts of calculations and aggregates. So you can say look for your motif match and all of the enhancer regions, all of the gene middle regions for one cell type and compare that to some reference cell type and again look for signal by comparing an aggregate set against some control and the other cell type. These are all things that you can do. So if you zoom in here, here's a little example of what segue can give you. Here's a small region of the genome. This is the sort of signal that I've been talking about. So these are several different histone modifications. Gypsy experiments, all right, and open chromatin here and basically segue takes these and a bunch of other experiments like it and boils it down to this one representation for this cell type, which says this is regulatory, this is TSS flanking, this is potential enhancer and so on and you can get a map like that for almost 200 different cell types. You can download it, you can go to segway.hoppenlab.org and there are ways to view this thing and the UCSC genome browser. So here you can see, you know, 14 different cell types and you can see which regions are, you know, repressed and which ones are active in particular cell type, right? So this is the beta-globin locus, which is you probably know is very important in blood cells, right? So you see this big red for K562, which is one of our blood cell lines that have gone in here, whereas you see it's repressed, I see beta-globin locus, sorry, alpha-globin locus, repressed and various other cell types, right? And this is sort of, you know, what went into this was, you know, approximately 200 chip-seq experiments and you can see it in a very, very compact way and it gives you an idea of saying what's going on in a whole different range of cell types, right? So that is something that you can use to pre-focus where you look for transcription factor binding sites. Another thing that you might be interested in is great. So if you have a set of genomic regions, if you have a set of genomic regions that they aren't necessarily related to a gene and you want to find out what they have in common, you can do the very same sort of gene set enrichment analyses we talked about before or that I expect you to talk about in the previous two days. But instead of working with the gene, you can work with a genomic set, right? So great. We'll take any bed file as input and give you enrichment measurements as output and you can see for a list of regions what sort of molecular functions or biological processes these might have in common. So great. It was useful for many years. My lab actually has come out with what we consider to be a accessory to great. You can now use instead which is called best. So we want to skip. We want to skip ever greater because we thought someone would just want us further so we went straight to best. Anyway, you can try that at some point. I should add a slide on this but it's brand new stuff and we don't have a paper outline. So you basically provide the software with the best files for each sample? Yeah. But it needs to be some it's for something where you have some sort of assay, right? Like let's say you're doing chip seek or you have some sort of enhancer assay or some sort of other assay that gives you a variety of energetic regions and you want to find out what they are. One thing you could do is a motif scan and another thing you can do is use something like great but try to use some sort of enrichment for various gene sets. But you have to be focused on the specific areas you want. You need a set of regions, right? So if you have something that gives you one area then you should probably say scan for motifs, look at what the segue annotations are on the genome browser, so on. But if you have a thousand regions it's harder to do that and get something that's an accurate statistical summary at the end. Is this suitable for a GWAS study? So you could try this. GWAS is complicated of course because when you get a signal in a GWAS it doesn't necessarily mean that the region of that you've gotten signal in the genome is causally responsible for whatever phenomenon you're looking at. It's often something that's nearby, right? So to some extent great will smooth out the sort of positional uncertainty there and to some extent it won't. And the one thing it really won't do is look for 3D genome interactions. It only looks for things that are kind of upstream and downstream and that's basically where best comes in, best adds in sort of 3D genome interactions to great. That's right. That would be a much better if you had say an attack seek experiment great would greater best would be a much better thing to use for that. So one other thing that I've talked about is transcription factor transcription factor interaction. So these things can cooperate or compete with each other and there are tools like SPAMO. So SPAMO will look for spaced motifs and you can find say whether there are a couple of transcription factors that motifs that tend to occur together spaced by a particular amount, right? So if you have two motifs and you always find them in your set of co-regulated genes and they're 20 base pairs away from each other, that might indicate that you have these two transcription factors interacting with each other in such a way that there's a 20 base pair space and physically in the DNA between where they will, where they will match to each other, right? And it's a much stronger signal for, then say, just finding that you have those two transcription factors occurring, binding sites occurring in the set of co-regulated genes. But I've promised to, promised to get to more about the biochemical context, which really matters. You know, again, if we go back to the the motif scores, right, the relative motif scores, if you look at a set of chip seed binding sites that are biochemically validated, you'll find a wide range of potential motif scores for a particular binding site, right? A particular kind of transcription factor, right? And this can change depending on the transcription factor, you might find different patterns of how close you're going to find the transcription factor binding site motif to the middle of the peak. And these are things that people are starting to consider more and more. Another thing that people are starting to use is information from allele specific binding, binding sites. If you have chip seek on a heterozygous genome, right, you can often get differential signal between the different strands, sorry, the different, different copies of the chromosome. And that can give you something that is, that has some built in control for noise. And this can be something that gives you much more confidence that a particular binding site is actually real. And it will also give you more confidence that a particular change in a motif might have some sort of effect that you think that it does, or that a particular position might be important or not. And there are a variety of tools that people are developing to see, to examine, excuse me, to examine, say, what sort of effects you might have from non-coding variation, even either in a population and changes in species and changes within a heterogeneous tumor. So there's a wide variety, wide variety of these. Deep-C is one, they have a server that you can use. You know, as part of a recent review on deep learning and biology and medicine, there are a lot of other methods like this that we discuss in there that you can try those as well. And finally, another thing that people have started to realize is an important part of this puzzle is DNA sheet. So we always like to look at these double helices. If you look at the models that people usually use when they're putting something up on a PowerPoint or in a textbook, rigid double helix, and it looks identical all the way down with no little changes. Very much unlike, say, a protein or a peptide, we have lots of changes and curving around, or even RNA. Well, that's not reality. So in reality, a DNA double helix is going to have, even though it's fairly rigid compared to RNA or a peptide, it's going to have lots of little changes that you can see and that are brought on by different patterns of DNA. And these, since transcription factor DNA recognition is very much shape-related, they can have a big effect on whether the transcription factor is going to bind at a particular position. So some of the things that we're talking about here, these are the four most important features and there are about 14 others, major group with role. So if you have two, if you have stacked base pairs in the DNA double helix, role is, to what extent are they straight on top of each other like this? And to what extent do they kind of turn this way? They roll together. And you have things like propeller twist. So you have two bases opposing each other. Sometimes they'll turn a little. Sometimes they'll turn a little the other way, right? These are things that people can measure, people can predict. And helix twist is when you have base pairs stacked on top of each other, how they move this way. There are a lot of different degrees of freedom. There are tools, a lot of this has been developed by Ruma Rose's lab, tools that you can use to predict, say, to what degree propeller twist will have a particular position. And now people are developing models that they can use to incorporate the sort of DNA shape into their prediction of whether there's going to be a transcription factors bound to a particular position. So using not just information from raw sequence, but also this DNA shape information. So there are a lot of changes, sorry, challenges ahead. So, you know, first is that we've only really studied a few hundred of these transcription factors in humans. There are probably about 1400 to 1800 or so transcription factors. And it's hard for us to understand how all of gene regulation works when we don't even have all of this. All right. Another thing people really want to understand what sort of effects non-coding variation is going to have on gene regulation that can be very challenging. Sometimes you'll have a lot of change, right? And some of these can be quite subtle changes in terms of any phenotype you can measure. People are developing lots of methods to try to decouple these things a lot better. People want to do a lot more things to integrate context, things like evolutionary conservation, things like DNA shape, things like biochemical context, things like where there's open chromatin and so on. And, you know, it's kind of, I think, the biggest area right now for doing prediction of transcription factor or binding sites is what sort of other data sets do you have available from references like encode and how can you incorporate them? And the final thing is, you know, this matrix model, it's very simple, but it's also kind of dumb in some ways, right? So it doesn't do things like let you incorporate the information that you might say never see two symbols next to each other, right? So it only incorporates the frequency of each of these symbols at each column, right? But let's say you have a high chance of seeing T here, column four, and a high chance of seeing T column five, but you never ever see T together in four or five, four and five together. These models can't do that, but people have been publishing other models that can, and what remains to be discussed is how important that is. So, Ken, I'll come back to this. There's a lot of complexity. This complexity is what we're trying to understand. There's going to be a lot of people both trying to take reductionist approaches to understanding different aspects of that complexity and people also doing things like throwing a bunch of different data sets into machine learning methods and trying to see if the machine learning method will figure out what is most important and what isn't. I think the future is going to be a combination of those things and better data. So I'll just close with these reflections. So first is the futility conjecture. Second is that to get over the futility conjecture, we need to use additional data, and that's the biggest area of research in this field at the moment. So with that, I will take any other questions you have. Anything else? Other questions? I get very excited about transcription factor binding, so I'm sorry if I was too excited about it. It's not everyone's cup of tea, you know, I have proteomics colleagues. When are you going to talk about what's really important, just the proteins? Well, I'm not. Okay, there are no questions, then maybe we should go to our break.