 So good morning everyone, so please call me Wyeth, I'm from the University of British Columbia as Anne mentioned. I've been teaching in the Canadian Bio-Informatics workshops on and off since 2002, so I am a big fan of the CBW, I think that it's one of the great national institutions and lots of your peers now across the country have benefited around the world, have benefited from CBW, so I hope you've had a good couple days here and that you picked up along the way. One of the challenges that comes with the third day of CBW is that there's a little burnout, so you've been working pretty hard, I gather you were here late last night doing some work, and so the third morning is often the low energy morning, so that you're going to be sitting there struggling trying to hold on and attached to it, so I will do my best to be entertaining as we go along here and tell you a little bit of the stories. You're going to notice that the slide deck came around from Michael Hoffman's version of this. Don't be too scared that I don't know what's in there because I actually created almost all these slides along the way over the years. The materials that you have been provided, I've made just a tiny couple of air changes on the most of the slides that you have there, so you won't see that. I've added pick numbers on the screen so that you'll see slide numbers, and I've got about four slides at the back of the deck, so it should be pretty consistent with what you have, so there won't be much deviation from what you see. Okay, so what are we going to try to learn today as we go along? Hopefully, you're going to understand a little bit by the end of this how hard it is to predict transcription factor binding sites. I'll just, if you don't know what those are, you'll understand what transcription factor binding sites are along the way. You'll be able to identify binding sites for known transcription factors, so you'll be able to take predictive models and apply them and predict scan sequences for binding sites, and you'll be able to discover motifs in DNA sequences, so you'll have some idea of how you discover the patterns that are present in a set of DNA sequences that you think are regulated by the same factor. Before I dive into all of these details about how you understand and do these things, I need to ask a question that probably every instructor in the courses asked you, but I just need to get a sense around the room and connect to faces a little bit of a sense of where you're coming from. So I'm going to ask you, on the biology to computing spectrum, I'm going to ask you sort of place yourself at one edge or the middle, so you can think about where you think you fall in that spectrum. So how many in the room feel like they fall on the biological end of the spectrum? Okay. How many feel like they fall on the computing end of the spectrum? Okay. And how many feel like they're closer to the middle than either end? Okay, so that's useful for me. Thank you very much. Okay, so there's a series of parts to the presentation. I'm going to take you through the introduction to eukaryotic transcription for the biologists. This is going to be really boring, but you at least get my perspective about how the computational methods have been developed along the way. You're going to learn about the prediction of transcription factor binding sites. Some of these methods date back 30 years. So they're not all terribly new in the bioinformatics space. You're going to learn about discovering novel motifs. You're going to get a warning about the reliability of these predictions and the challenges that come with it. You're going to learn about finding enriched transcription factors and co expressed genes and really in any gene set that you have in terms of understanding that potential regulation. And then you're going to learn about incorporating information, extra information into the process. So let's think about transcription for a moment. And we're going to start with the incredibly oversimplified version. So you can imagine a DNA sequence. And you can imagine that there's a transcription factor binding site over here, a sequence that a DNA binding protein is going to recognize. The DNA binding protein lands from outer space onto this particular transcription factor binding site. Remember, this is the oversimplified version. And it casts off a signal to the mother ship in space, and it sends down a polymerase complex that lands magically nearby, or at least in relation to that. And that RNA polymerase content protein will then be launched off along the DNA and start transcribing and producing RNA. Now that's for much of what happens in bioinformatics. That's the model that a lot of people were using. And that's part of the reason that we have lots of flaws for the biologists in the room as part of the reasons that you have some limitations in the procedures that are used. Okay, now, if we think about some of the more complexities of this process, we think about not one transcription factor binding site, but we think about lots of different transcription factor binding sites. And we think about those proteins that are coming in and interacting with the DNA as having a variety of target sequences and that they can recognize. We can think about a core promoter. And for me, I think about a core promoter being a sequence that's involved in the positioning and initiation of transcription. Oftentimes, you're going to hear the concept of a transcription start site, or the position that that polymerase starts producing RNA. But we know that it's from much of the work that the phantom projects have done in Japan, that it's really much more of a transcription start region. So for about a third of the gene promoters, you have a well defined transcription start site. And for about two thirds of the genes, you have sort of a fuzzy zone where the polymerase tends to launch from. So it's not as crisp and precise as you would have it be. You've heard me say transcription factor binding sites. So that's a recognition sequence for a specific sequence specific binding protein. And then in terms of the regulatory regions, you can have positive and negatively acting regulatory regions. And these regulatory regions can be close by to the promoter. And so those are what we think of as proximal. And you can have them far away, they can be far away five prime from the start, they can be within introns, they can be three prime of the gene, they can be even with overlapping exons, so that's rare. So you can have regulatory sequences all over the place. And the difference between a proximal and a distal regulatory region is completely arbitrary. So it's whatever people feel like at the moment, what they'll describe as terms of those proximal and distal sequences. If we start sort of thinking about this in more than a linear strand of DNA, we start thinking about DNA being, oops, might get this back, we think about, we think about, think about DNA as being in three dimensions, that these target sequences and the proteins that bind to them will be interacting with additional proteins that don't bind to the DNA. They'll be forming these these interactive events. Modern thinking says that these things are not assembling a big machine here. So these proteins are not all there present at the same time. But that perhaps these are visiting at the different moments in time so that you'll have them stopping by and working. So we're still trying to figure out that sort of dynamic temporal aspect of what's happening in these sequences. And you're going to see sort of this nice chromatin structures coming off of here and that we used to sort of wave our hands at and increasingly we have some better insights into the chromatin characteristics that inform where the regulatory sequences are going to be found. Within the functional genomic space, we've had an increasingly large amount of tools for profiling and characterizing the regulatory sequences. And so you'll hear me mention some of these types of technologies and tools along the way here. But just to give you a little bit of exposure, we can think about things like chip seek and these types of profiling of the modifications of the histones or the DNA that have occurred along the way. We can think about DNA seeks I'm sorry for along the way which is profiling sort of where the open regions are the regions that are more accessible, which tend to be associated with things that are functional within a given context or cell. You have RNA seek profiling which you've heard about along the way here in terms of profiling the activity of the RNA. And you have at the sort of linear level you get those things defining specific regions. And then not shown on the slide but you increasingly are getting technologies like chromatin capture measurements, which will describe the relationships across longer distances such that certain sequences are interacting with certain other sequences in that space. And within that, sorry for my sensitive fingers here, within that, you start defining now regions called TAD or topologically activating domains, associating domains. And these TADs really define increasingly zones that you think those cis regulatory regions are connected to promoter regions. So you can begin to use those TAD boundaries to define the zones that you pay attention to. The challenge that comes with working with that type of information is that's still very large scale. So you'll be talking about mega basis of sequence often times and those types of information. So the work of the field over the last 10 years or so has been just generating boatloads of data about this type of profiles of characterizations. And the encode project is the most prominent of these though there's a number of other projects that have been in large scale. And in these projects, we now have sort of a compilation of resources within some cells that describe where regulatory regions are. You can access this information online. You've heard about some of these types of resources along the way and used a little bit in the shop. The encode project has a page the UCSC Geno browsers rich with most of this information provided the repositories have a lot of the sequence available to you. So you can get your hands on a variety of this types of data and we'll come back sort of later in the presentation and and talk about a little bit of some of the tools that incorporate this range of data within the tools that you work with. Okay, so now for the computer scientists in the room, you're going to get sort of the insulting the simple version of things just as I just insulted the biologists with some of the simplicity of the regulatory sequence information. But we're going to think about how do we teach a computer to predict a transcription factor binding site. So if you think about where binding sites come from usually some poor student has worked through the lab and identified some sort of sequence that a transcription factor recognizes, or increasingly now some genomics lab has taken a transcription factor and done an extensive profiling in a in a in vitro type assay where they identify sequences that a protein will buy to. So you can start with a single site. You can represent that as a DNA sequence. If you do enough or you do one of these large scale experiments, you might have a large set of such sequences that the protein will bind to. In the ancient ancient days, you might represent that using an IU pack consensus sequence, which tries to convert all of that to some sort of string that you can use. But that doesn't capture the quantitative characteristics of the site. And so most of what the field still uses, though, there's an increasingly interest in moving beyond this is to convert it to a matrix model. And so in the matrix model, all you're doing is you take an alignment of all those binding sites, you count up how many times you see A, C, G and T at each position and put it into the matrix. So if you look at the first column of this sites, hopefully it matches up to the count on the first column of the matrix. But that's the concept that you just count how many times you see it. And a frequency matrix is convenient. It's a convenient representation to to spawn off a number of other representations of information. And the one that you're going to most commonly see as a sequence logo. So a sequence logo is a visual presentation of the information content of that that matrix that you saw. And so all it's doing is convert saying, Okay, if you think about DNA sequences at each position, having two bits of information, so a bit of information means that you get an answer to a yes, no question. So in an ACGT universe, if you can ask two yes, no questions, you can define what the base is there. So for instance, is it a purine? The answer is yes. Then you can say is it an A? The answer is yes. So by two yes, no questions, you can determine what nucleotide is there. So quantitatively, you can confuse information content measurements, and you can make a mathematical conversion of this frequency matrix into an information content matrix. And then you can just plot it, according to height, you'll see these sequence logos all over the place, you probably have for many of you along the way. But it's just a nice tool. The other conversion that we do with these frequency matrices is that when we actually want to use these matrices to score a potential binding site for a transcription factor, we mathematically prefer to work on a log and adjusted system, because it makes our mathematics a lot easier in terms of determining the reliability of scores. So what we usually do is that we convert a frequency matrix into a weight matrix. Sometimes they're called converted into a position specific scoring matrix, position specific scoring matrix and position weight matrix are exactly the same thing. Position specific scoring matrix is often called a possum. So if you hear possum, it's the same as PWM. So we can take the frequency that we observe in the matrix, so for instance, the five at the first position for a, and we can adjust it for the background frequency of the basis. So if you have a genome, most of the time, we assume that it's something like 25% age, C, G and T. But if you have a genome that is particularly a rich, you might adjust your background frequency to account that a is not that surprising to you. People have done some nuanced things where they try to adjust across the genome and do all sorts of funny things with that. But basically, most of the time, you'll see an assumption of 25%. Then we add a pseudocount value. So that's that little s of in. And that's the high and mighty languages that we're adjusting for our confidence in the data that we see. So we don't think we have enough sites sampled. So we don't believe that we've got a perfect representation. The not high and mighty language of this is that we can't take a log of zero. And so we have to stick something in there for the zero values along the way. And so all sorts of different versions of how you add that pseudocount value, so different matrix databases will use slightly different forms of that. Then you take a log of that function that generates your PWM. And then from that weight matrix, you can now score any DNA sequence for its match to the pattern that you're looking for. So for instance, that TGCTG down here can be generated a score by summing up the TGCTG cells from that matrix. Is that reasonably clear? Okay, so you shouldn't be too scared of these matrices. That's all they're doing. That's you're going to see them used a lot now and there's going to be scores coming up from them. Now the types of scores you see are going to vary a lot. So just to tell you a little bit about how those scores might be given to you in a different tool. So you can generate a raw score. So the top box up here looks at generating raw score exactly the same as what was done at the previous page. Oftentimes you'll hear about things given a relative score. And that's because the scoring range is matrix specific. So depending on the width of your transcription factor binding site, how many sites you see the range of possible scores is different. And so people often prefer to think about them in terms of a percentage of position across that range of scores. So you see those that type of scoring a lot. More and more, what you'll see is some sort of p value generated off of that. And I'm not a big fan of p values in this type of situation, because usually our assumptions are completely wrong. In terms of generating those p values. So I usually think about the more as p scores. So you're essentially using p values as a score, as opposed to a reliable p value. So what usually happens is that you'll take some sort of long sequence of DNA, you'll scan that with your matrix, and you'll observe every score that you get along the way there. And you'll build a distribution of the scores that are observed. And then you'll take your p value by saying, okay, given the sole core of a site, how many what proportion of the sites fall below that score in the range. And we'll use that to adjust and calculate this p value. You're going to notice that's an extreme value distribution that I've shown there. And the nature of the way these PWM seem to score is that you get extreme value distributions as the type of model that you see. So that curve that tail coming off the right can be quite long. So you'll have a wide range. It's also true that different matrices will have slightly different curves and different characteristics. So jumping from one matrix to another, the p value type scoring is generally the best way to make comparisons between them. Now there's lots of different databases that have been created over the years for pooling together and collecting these matrices, generating these matrices. And my lab happens to create one called Jasper. That's nice. The nice thing about Jasper, in our opinion, is that it's free and open. So we try to make sure that it's available to everybody. And we've been doing it for a while. This is not a statement that Jasper is the perfect thing. It's just the one we work with in my lab. For those of you who've been using Jasper and really hit the mate interface, which it really stinks. The interface will be all new in about a month's time. So we'll be releasing a new version to coincide with the NAR database issue for January. There's a few other, I seem to have lost my notes here. There's a few other, maybe just one moment. There we go. There's a few other collections that I can recommend. If you don't want to use Jasper. There's Uniprobe, which Martha Bullock's lab at Harvard develops, which is not a lot of protein binding array type data. SysBP is one that was developed at the University of Toronto here, and also sort of coincides with a group in Ohio that sort of spun off of that, that has a nice collection that pools a lot of different databases together. So it pulls in a lot of resources. There's another one in Russia called Hokomoko. And Hokomoko is, they basically go to all the different databases and try to pick the optimal matrix for a given transcription factor. So Hokomoko has some nice functionalities. As the nice thing is that Jasper often is selected, so I get a little pride as we go along. But it's a nice collection just because you get that refined set. And there's been some comparisons and papers or benchmarking and looking between these types of systems. And the bottom line is there's not much difference between the quality of the matrices that you get out of them. So there's subtle that types of differences are relatively subtle. So if you find one that's convenient and you like and seems to be working for you, it's going to be probably pretty much as good as any of the other ones along the way. Okay, so I've told you that these databases exist and they collect these pools of these binding profiles. You understand that these binding profiles look like now how they're generated roughly. But there was still a little bit of magic to that representation because I showed you in that picture an alignment of all those binding sites on the right hand side of that screen when we were counting up how many A, Cs, Gs and Ts. And that came out of outer space again. And so we have to think about how do we actually find those types of patterns in our sequences. So that's a problem with de novo discovery. So it's basically a pattern discovery procedure and there's a whole domain of bioinformatics that spends their life worrying about how you do pattern discovery. And I'm going to give you sort of one flavor of that along the way here. But most of the flavors are somewhat similar. So as you think about those different methods, if you're into the field, then you know that you can get nuanced and you can do some pretty creative things. But you'll at least get the type of you won't be complete magic anymore. So here's the problem that we have. We have a given set of sequences. And we think within those sequences, there's some sort of common motif, find my cursor again, common motif that's there. So a similar pattern, we know from our matrix models and those information content pictures that these are not identical sequences. If they were identical sequences, this would be an easy problem. But these are similar sequences. So they have some similar characteristics. So what we want to be able to do is given the set of sequences that we think share a common characteristics, we'd like to be able to find these motifs. We know how many motifs are there. So we have to figure out how many are going to be present. We need to figure out how wide they're going to be, because we don't know the width of these patterns. And we need to figure out where they're located. So we've got a few different problems to solve in the same same go. Now, if this were everything was short, and you had to start 15 base pair sequences, and you could get your data refined to that level, and some of the new chip exo technologies actually start doing that. But most the time what you have is a problem where the sequences that you're looking for these patterns in are long. So your sequences might chain. And the patterns that you're looking for can often be subtle. So if you are looking for something that's really got a lot of consistency in the sequence, it's again, relatively easy problem. But the more variable those binding sites are, the more difficult the problem of finding where they're located. So let's work through an example. So let's take a set of sequences from a set of co-regulated genes. So you've done some sort of transcription RNA-seq profiling across maybe a time course, some sort of treatment. You say these are a cluster of genes that are coming up together at the same time. And we think maybe they're regulated by the same transcription factor along the way. So this is speaking to your network type of perspectives and pathway analysis. So here's your set of genes. So we're going to go in and we're going to look for it. One of the challenges is that our DNA, and this is a common error on the computing side, one of the challenges to recognize is that these DNA sequences are double-stranded. And so you have to search for your patterns on both strands of the DNA. So you're going to be looking both forward and backwards along the way. So the unknown transcription factor will bind to positions unknown to us on either DNA strands. So you find these patterns by procedure. And then once you found these patterns, and I'm going to come back and tell you about the magic of finding these patterns in a second. Once you found these patterns, you can describe that pattern as a matrix model and represent it as information content. So let's go through one sort of way of doing this. So there's sort of generally two classes of motifs discovery tools that are used here. There's ones that are string-based methods where you count how many nucleotides, patterns how many times of each in-mer you find and use that as a tool to work through it. The other is a matrix-based way where you're looking for a pattern as a whole. And I'm going to describe the matrix-based approach here for this process. So what we start with is we guess an initial matrix. And that's going to sound really weird. So we just randomly guess, given our sequences, we randomly assign a bunch of positions and say this is where our binding sites are located. And almost all the time this is noise when you do that. But every so often you're going to find in that guess you're going to hit on some portion of the binding site that is more reliable. And so those times that you hit on something that's a little bit more consistent with the pattern that's going to nudge the further process in the right direction and it's going to start zooming in on a refinement procedure that's going to get there. So we guess that initial matrix, we construct a matrix, a weight matrix out of those original positions. We then scan our sequences with that guest matrix and we use that to then reselect new sites, construct a new matrix, scan the sequences, select sites. And you just keep doing this over and over again. The key here though is that when you scan with your initial matrix, your next guess is not random, but it's biased by the matches to the sequence. So I'm going to walk you through that graphically here. So we take these sequences and we randomly guess our positions within them. So just throw down some sequences. We take those sequences and we generally put them together and we count how many A's, C's, G's and T's at each position and we construct a PWM or frequency matrix in this case. We then take that matrix and we scan through a sequence and we see the scores on that sequence given that matrix that we just scanned with. But now and we see that there's a slight match there, a better match there, a slight match there, and not a very good match across the sequences. So now what we do is we say for a given sequence we're going to reselect the binding site and you can do it in two ways in most these methods. One, most these methods will be one of two forms. They'll either take the best site and just to maximize the score that they get for the next site or in a Gibbs sampling procedure what you'll do is you'll guess in proportion to the heights so that your odds of landing on that one for instance might be 52% of the time your guesses will pick that one, 20% of the time it might pick that one and so on and so forth. And so what you can see out of this type of model is that most of the time it's noise, most of the time it's going to be pretty flat in terms of the distribution, but as you get one or two sites coming into your matrix your odds of picking it start increasing and you'll do that. So you reselect your site and then you repeat this process over and over and over and over again until you get the best scoring matrix that you can which basically is maximizing the information content of the matrix that you're you're deriving. One of the challenges with these types of methods is they have a tendency to get into what are called local maxima or local characteristics and so that means they'll find some pattern that may not be the best pattern but it's a pattern and they'll get stuck on it. And so what you do is you will re-launch this pattern to search a thousand or ten thousand times on your sequences and then just pick the best ones of those those pattern searches so that you have the chance to not get stuck on any one pattern but you get a chance to pick them all up. That's confusing. Do people have some sense of clarity that you can how you can these methods do? So these types of methods the one that you just it's not precisely this way but the one that most people use right now is called Meme for doing these types of pattern discovery tools. It's more of an expectation maximization method which basically means it takes the best sites as opposed to the the random the guessing between the 52 percent and the 20 percent it would always take the 52 percent. Once you've got a pattern that comes out of this then you have a question of what does it look like because you say okay I've got a pattern but I'd sort of like to know is it a pattern that I know something about so is it a known transcription factor or is it something new that's different and so there's a variety of tools out there for taking a motif and comparing it against databases of binding sites and saying okay which are that what's the similarity so it's just like a blast for doing sequence similarity you're going to have this concept of a matrix similarity score to compare your matrix to a database the one of the ones that is better known is called TomTom it's part of the Meme Suite so if you're using Meme then you can you access TomTom within that system or you can use it directly. So TomTom then does that database comparison and says okay the pattern that you gave here's a similarity score comparing it against the database. I'll come back to a challenge with that in a little bit which basically says that transcription factors in the same structural family with one major exception within the same structural family tend that bind to very very similar sequences and so you have a hard time defining which specific transcription factor you're looking at when you get a match you might say it's a fan hits to a family of transcription factors but it won't give you necessarily the one. You can do the same type of concept with protein sequences so you have a 20-letter alphabet versus a four-letter alphabet so but the underlying mathematics are the same basically. It's a good question. Any other questions? Okay how are we doing on the last day morning sleepiness are we still with me? Okay let's let's dig in a little bit on the effectiveness of these weight matrix models because I've told you that we can generate them we can find these patterns we can scan sequences with them so can we actually find binding sites with these these tools and so we're going to go back in ancient history now before some of you were born so I'm getting old now so back in the 90s French group identified took 50 predicted HNF1 transcription factor binding sites so they scan sequences with a matrix model for this protein HNF1 they took those sequences and they did an in vitro so a test tube based dividing assay and showed that in 96% of the cases meaning 48 of 50 they observed binding of the protein to that that sequence so that's pretty good they're scanning sequences and when they test them in in a lab the protein will stick to those sequences so that's that's promising also good really good Gary Stormo who's like the greatest person in this field so if anyone wants to read papers in this field just read Gary Stormos if you read his papers now you'll see what people are doing five years in the future Gary Stormo's lab showed that there was by doing sort of very precise biochemical measurements of binding energy so looking at proteins and DNA together and measuring their thermodynamic properties and getting their binding energies he then related that to the scores that were coming out of the matrix models that he had and he showed that there was an extremely good correlation between the scores that were generated by matrices and the binding energies so that's reassuring your matrix models are essentially giving you a representation of energy they're not the same scores so just to be clear you're not going to get the actual binding energy out of those ones but you're going to get a correlation of the same characteristic so I told you the good parts the bad parts are thicket in 95 took a built a really good matrix model for myod and predicted binding sites once every 500 base pairs with that matrix model and that corresponds on depending on how you if you think about a 10,000 base pair of genes that's about 20 sites for myod protein per gene this is a highly important protein for muscle regulation and it's not going to be binding 20 times to every gene in the genome and so it's basically saying that binding sites are all over the place and if you take a database of binding profiles so myod is one you take a database of 800 profiles and you scan across the genome you're going to get something that looks a little bit like that so you basically have the the concept that these binding sites are being predicted far far far too often to be biologically relevant and so that led me to call it the few I originally said futility theorem mathematicians corrected me so it's a futility conjuncture which basically says that the binding sites that you predict are almost always wrong so that doesn't mean they're useless just means that they you should not be taking the individual predictions of individual binding sites and saying these are going to be biologically functional events and this isn't inappropriate as we think about the biology that we talked about at the beginning we know that you have all of this additional information about the chromatin structure and how things are organized in the nucleus and so these proteins are not going to be landing up from outer space onto a naked piece of DNA they're going to be in a very complex nuclear three dimensional space they're going to have certain regions of the DNA that are open and accessible to them and so they're only going to be exposed to a certain subset of sequences and that DNA sequence that you're analyzing with a matrix model is not telling you anything about what's going on around it what are the openness characteristics what's the chromatin state of the DNA and so you are lacking that information so if you have some reason to think a region is interesting in terms of it's the open and it seems to be important for your your situation then scanning within that region for a transcription factor binding site makes a lot more sense than just scanning along tens of thousands of bases now the other place that this becomes useful is if you have a set of genes that are co-regulated and you want to look for a enrichment of a pattern and we're going to come to that in a minute you can use this type of matrix model effectively there but be very careful if you're just scanning DNA and trying to say this is a binding site you need something more to get to that reliability okay now one thing that people frequently try to do is to say well I have a dial on these things I can tune my tolerance and I can just put make a more restrictive call on my my site so talked about that you can generate a score for your site you can use that as an empirical p-value for instance or you can just have a general score within it and so in theory like most methods you'd say okay I'll just make it more stringent and I'll get a better predictive value so basically I'll get more positives than negatives in terms of the ratio of predictions that's true to an extent within transcription factor binding sites but in general once you get to sort of the recommended thresholds by most these methods you your positive predictive value tapers off so you basically don't get much value of turning up the stringency after you get to a certain point and why is that why can't you just make it say I want a 99% score as opposed to an 80% score for instance on the range is because the site is good enough for that protein to bind to and so once the site is good enough for that protein to bind to then it depends more about the intense pretty much entirely on the context of whether that site is going to be accessed by that protein. Another thing to recognize Martha Bullock's group a few years ago did a really nice study showing that certain genes will have evolutionarily tuned their binding sites to get the right types of characteristics and that tuning of the binding sites for those things that are expression dependent may mean that they've gotten somewhat less than perfect binding sites along the way and so biologically there's some kit selections that are saying we may not want the absolute best site within that that region so you can't just sort of crank up your threshold and say that's better for what you're doing. Okay so some things that we've gotten along the way here hopefully that you'll look to this and say the ass they think I get that PWMs actually reflect in vitro binding properties so if you take a site that you predicted you put in a test tube generally it's going to be bound by that protein. Binding sites occur far too often in terms of patterns that you're finding to be used as a reliable indicator of in vivo function so you can't just take it as truth and that you really need to bring in that extra information to in order to understand what the binding sites are doing if you're trying to call real ones. Okay here's now that I've told you that there's limitations and that you should be careful. Here's a method that actually helps for interpreting things that uses these matrix models and that's for saying given my set of genes that I think are co-regulated and I want to figure out what's regulating them let's scan them for binding sites. So the core here is you have a set of co-expressed genes you have some sort of control set so you have some sort of appropriate background there's a lot of nuances you can get into in terms of having matched appropriately matched control genes in the set in terms of sequence compositions and like some of the tools will do for you in the background. You have a matrix model or at least you have it may be a database of matrix models you pull one out of that database to scan with. You scan the regulatory sequences of those positive set and negative set and you observe that you have an enrichment of your pattern within the positive set to the negative set. Now I've shown you with the overly simplified insulting version of this which shows you all of the motifs sitting on the co-expressed genes reality is that you're going to have a quantitative difference so you have some in the negative set and more relatively in the positive set. This type of enrichment analysis much like you've seen in some of the earlier segments of the workshop which is the same concept you've had in terms of looking at the go-term enrichments along the way exact same type of methodology is very powerful because you can have relatively more noise in your system and still pick up that there's an enrichment along the way. So in the in the lab of course this morning you're going to use a tool called iRegulon which Stein Erick's lab has produced which is a relatively user-friendly tool for doing this as opposed to some of the ones that are out there and the more more command line levels and iRegulon lets you do these types of analyses effectively within the system so you're going to spend some time in this space a little bit later on and figure out what's going on there. Okay so I mentioned this challenge that when you get a hit and you say okay I found a motif that seems to be enriched there that motif doesn't necessarily tell you which specific transcription factor is acting on because that motif may be essentially giving you an indication that's a related transcription factor so for instance here are a bunch of logos for its family transcription factors so these are different transcription factors within the same structural class and you're going to notice they look awfully similar to each other and so the fact that you get a motif match showing enrichment for that one doesn't really tell you that that is any particular better than that one or that one or that one along the way and so you have this interpretive challenge that says okay I think I have an X transcription factor acting on my set of genes but then you have to go back and do some biology and basically under say okay within the biological context that I'm looking at which of the X transcription factors might be more relevant and that's still not a perfectly defined problem people have not sort of solved how you really do that so oftentimes that's still a lot of biological interpretation there in terms of understanding literature or looking at which factors are expressed in your your context or if there is genetic information that so would suggest which ones might be involved a lot of tools that are emerging for for a clinical genome interpretation where you have the concept of attaching phenotypes and getting a prioritized list of genes are going to be useful for that because you can say okay my candidate genes in that context are the X family transcription factors my term my clinical terms might be related to the the processes involved in my my biological context of interest and then you can get a priorities prioritization score based on a variety of characteristics so I mentioned top gene in this particular slide but you could pretty much take any tool that gives you a way to prioritize a set of genes based on some phenotype characteristics okay so changing topics so research your brains from it clear out the the concept of transcription factor binding sites for for a bit here and put yourself back into the context of those early slides about chromatin those histones floating around those modifications and all of that in code work and other projects that have been compiling that information so what that has provided us in terms of the biofrags communities the capacity to begin to predict which regions are have which functions within the genome so what you have is you have some sort of measurements so for instance this might be a chip seek profile of a histone modification this might be a DNA hypersensitivity measurements of openness this might be some sort of indication of of polymerase activity along the way doesn't really matter but you have this idea that these are different biological measurements that have been taken along the way and then we can say okay we can scan across the genome and we can say as we look at regions across the genome of equal size we can do a comparison say which of those regions look like they're very similar so you basically say okay this region here of a thousand base pairs looks a lot like this region over here of a thousand base pairs they have the same type of profile they have the same chip seek modifications showing up they have the same openness and so we can group these into sets and then what we can do is once we group them into sets we can say okay within those sets what characteristics are abundant so you can say okay for instance one set might look like the end of a exon or end of a gene one set might look like a enhancer sequence from a data set of known enhancers that are enriched in that that collection there might be repression sets might be promoter or start region so you can be able to label across the genome and so the segue tool the Michael Hoffman's lab has been doing and there's a few others Chrome HMM is another one in the same class these types of tools create these predictive classification scores and you then annotate the genome across and say okay here's a gene start here's a gene middle here's a gene end here's an enhancer region here's a so on and so forth along the way and so when we talked about how do you limit yourself to a certain subset of regions one way to do that is to take the outputs of these tools and say okay here's the enhancer regions for instance that are in my my gene of interest and then you can start working within those enhancer regions to do your analysis as you go in you can look in the genome in the UCSC genome browser and the segue calls are within that system so under the regulation banners within there so you can see what's what's coming in that set the and there's an increasing number of these ones and there's some really cool ones using deep learning methods that are getting better and more refined along the way the way that they've traditionally done have been called unsupervised methods which basically means we don't have a set of known truth that we're training for we're just taking everything and clustering them together and then figuring out what groupings we get out but we're increasingly now seeing some methods starting to be developed which use supervised methods and that's because we're starting to get enough known cases of each of these to have an abundance of training data to develop the methods so this is looking at a transcription start site and looking at some of the profiles and seeing that you can see some distinct characteristics that the methods are picking up and learning so segue is online is available through the Hoffman lab they just created the slides that the old version where they just created a new version this past year so there's a nice refresh on it for those of you haven't checked in on it for a while and those are some of the skin another and I'm going to take you through a couple tools and a couple methods here as we go along the next one I want to mention is related to some of the things you did earlier in the workshop so you looked at sort of go enrichment analysis in some of your earlier segments one of the tools that's out there that's quite nice for doing this is called great and what great does is you give it a set of regulatory regions that have come out of an experiment that you've done for instance a chip seek experiment where you're looking at a transcription factor you can give it a set of regions and say okay I I see binding of my protein to this region to this region and what it does is it says what's the nearest gene that's most and nearest doesn't always matter but from a statistics perspective it works here what's the nearest gene to this region I'm interested in and then it says okay what are the go terms attached to that gene and so then you can say what go terms are enriched for the regulatory regions that I pulled up so it's a way of going from regulatory regions to the go enrichment characteristics oftentimes it's very successful so if you have a chip seek experiment in a cell and you do a great analysis you'll oftentimes come away quite happy because you'll see a actual relationship to the biology that you thought you were studying and so for those of you writing up papers on this it's really easy and nice just to throw in a great analysis that's usually a quick win for finding out showing some reliability of your chip seek calls okay so what's it take it takes as an input a bed file I don't know if you've talked about bed files along the way bed files okay so I'm not going to try to introduce you to bed files I'll tell you that when you you should as bioinformaticians now that you're coming out of the workshop and you're going to be like called by a friend titions by all your friends you really should go figure out what bed files are and basically all they are is a very simple annotation that says from here to here in the genome there's some characteristic and so you can read about them on the UCSC genome browser there's lots of tools out there for working with bed files that make it very easy so that are fast ways to compare combine and interact with sets of genome annotations in terms of regions of the genome so as a take home useful exercise go learn about bed files it takes as input now most the tools that you use for processing cis regulatory regions calling chip seek peaks have an option to output as bed files so if you're doing any of these types of things you'll have an output option to bed and then it gives output is the enrichment measures just like most these go enrichment analysis tools the online tool gives you some graph and visual representations and like so if you use the online service you can can get a little bit more another tool now I told you I was going to be jumping around so I'm going to jump around one more time here another thing that people often are interested in is what transcription factors are interacting so I told you the beginning of the talk that the overly simplified view of the universe was this transcription factor one transcription factor landing from outer space and then I show you the next slide that we are really talking about multiple transcription factors interacting so once you've got one transcription factor that you're in your study oftentimes the question is what other transcription factors are working with it and so it's then comes the question of do I see evidence in the genome of the motif for my factor of interest having some sort of relationship to a motif for another transcription factor of interest and oftentimes what you look for is things that are either close by so that are physically close or if you want to get more specific even maybe some defined spacing characteristics between them that defined a relationship and there have been a few tools over the years that look at this the one that I recommend for you is called spamo so spacing of motifs spamo it's part of again the same meme suite of tools so meme tom tom spamo are all sort of part of this tool Tim Bailey has created in Australia he was originally at the San Diego super computing center and so a lot of these tools are available through the San Diego super computing websites spamo basically does what I said in the previous thing it takes a motif a and a motif B and says do we find them together more often than we would expect by chance so it takes a background set and a foreground set says how close are they together and gives you an output and says these are the ones that show the most relationship to one another okay now I'm going to told you that I stuck in a few extra slides along the way here they're not going to be in your books so the next few slides will be ones that are not in your your physical deck so apologies they are in the online set but hopefully it's provides a little bit so I told you earlier that predicting transcription factor binding sites is hard that you have all of these sort of additional things and you really have to think about where they fall and that's really a statement that context matters so if you're going to find a pattern within the sequences you need to understand where it is within the genome and in particular this comes into play for me increasingly we're doing clinical genome sequencing we're trying to predict where transcription factor binding sites are that might be disrupted and contributing to the phenotype of patients and therefore you need some capacity to do this so slightly outside the scope of pathway analysis but it helps you understand things a little bit in terms of understanding where these binocytes are so I'm going to show you data in a slightly different way for one slide just to give you a bit of a reminder so this is data for two different transcription factors let's focus on h and f4a over here on the left side so what I've done here is I've taken a set of chip seek data so this is h and f4a chip seek data I've taken an h and f4a motif model down here and I've scanned each of these chip seek regions for the best match to that motif and I've plotted the score of that motif versus the position in the chip seek peak so relative to the maximum of the peak and what you're seeing here is a nice enrichment of the the best motif within the very close proximity to the peak so that's good and but you're also seeing that oftentimes your best motif is falling well away from the the chip seek peak and that's telling you that you those ones there there's a better binding site somewhere else than the one that's falling under your your peak position and so you really would like to better understand what's functionally driving it to be located within this region it's helpful to look at this for sort of review of some of the concepts that we've talked about because this sort of speaks to this is very dense down here but the relative ratio of best sites to sort of false sites is pretty consistent as you go down here and another thing to take away from this is that chip seek data still has a lot of noise because down here this there's no binding sites essentially for the for the motif for the transcription factor in the site the region that you pulled down these types of plots are useful if you're doing chip seek data these types of plots are useful in general because oftentimes we'll run a chip seek experiment through and there will be no enrichment of the motif at peak max positions which usually is an indication that something's really wrong so it's a good quality control okay so context matters now still sticking to chip seek data one of the things that we've done is to say can we begin to identify where alterations in the DNA sequences impact the binding of the proteins and so what we did was we took as chip seek data sets where we had the whole genome sequence available to us so that means that we know the genotype at every position across the genome and so within the data we can then find those positions where we have heterogeneous mix of two alleles so that we have allele A and allele B so two different nucleotides at a position and we can find those cases where in the chip seek data the protein is preferentially binding to one of the alleles so that most of the peaks most of the data being pulled down in the chip seek experiment is from allele A for instance and allele B is not bound at all by that transcription factor and that gives us a sense of saying okay here we have a case where we have experimental evidence that says this protein is binding to this one not binding to that one in exactly the same cell and it's a way to sort of improve our capacity the good news so again talking about where these matrix models are useful if you look at the cases where there's allele specific binding so there's a bias and compare it against a control set from the same experiments of cases where there's balanced binding to the two alleles and you take a look at the scores generated by the weight matrices on the two alleles what you see is that the bound allele has a stronger score in the allele specific binding cases whereas in the control cases it's really pretty evenly distributed so it's telling you that your matrix are doing a reasonable job of giving you binding energies at some level but it's not telling you is that in 80% of the cases where we see allele specific binding there's really no impact on the transcription factor binding site because the variant that we have there falls outside of the motif so in all those allele specific binding things we pulled down only 20% of the time was the variant actually physically in the the binding site itself now that 80% that's missing may be a mixture of things it could be some other transcription factor that's altered and our analysis suggests that's about 10% of the time but in most of the time it's falling outside of the region of a binding site and that either means that something else upstream is happening so that region of DNA might not be open on certain alleles or that we don't really understand enough about what's going on in these regions and the evidence that we're seeing increasingly suggests that we don't fully understand what's going on in these regions so our motif analysis really pinpoints that sort of close in position where the protein is sticking into the the interior of the helix of the DNA and you're getting this close interaction but in order to have an effective interaction with the DNA it's not just that sort of sticking in but it's the context and the part of the context is sort of the shape and bending of the DNA itself so there's a tool called Deepsea that's developed by a group at Princeton and Deepsea uses a deep learning method to predict transcription factor binding sites and a score for variant impacts within binding sites and Deepsea looks at about a thousand base pair window so whereas these motif models are looking at about a 12 to 15 base pair window on average Deepsea looks across a thousand base pairs and tries to predict how all those thousand base pairs together contribute to having in the right environment for a transcription factor binding site it's limited to the cases where we have a load of data to train so it doesn't not available for all profiles or all transcription factors but it's a very interesting way because it's really showing us that if we could understand everything about the local vicinity around our transcription factor binding site we can improve our predictions and characteristics and one of the things that's really hot in the field right now and in the literature a lot is tools that incorporate work that Remo Rose's lab at USC have done USC has done which is looking at the three-dimensional structure of DNA itself and so what we know is that DNA is while we think about Crick and Watson and their amazing work we know that DNA actually can be twisted a little more tightly or relaxed a little bit it can be folded a little bit so there's all these different characteristics that can be adjusted in the three-dimensional space of the nucleus and so Remo's work has really gone in and developed models to predict based on the sequence what the three-dimensional properties of the DNA will be and that then allows you to say okay we can see not just at the sequence level but that there's additional conformational properties of the DNA that can be incorporated into these models it's subtle so it's not going to be a dramatic improvement over the other methods but it's adding sort of this extra little characteristic and it's probably some of the characteristics that things like deep-sea are starting to pick up and describe in their more complex models so you can take any given sequence go to the website that provides the TFBS shape tool scan it and it'll give you sort of these measures of the three-dimensional properties of the motif that you're analyzing and working with so that's just to give you a flavor that there's more things coming that some of these methodologies are going to be refined as we add in some of these additional characteristics so nearing the end here just giving you a few statements of things that are going to be coming a little bit ahead now that we have sort of armed with our tools of genomics and complete inventories the transcription factors a lot of the work ahead is to understand the roles of these different transcription factors over the temporal time and how things are changing over development or progression of disease understanding the genetic variation and transcription factor binding sites being able to better integrate these data sources in ways that are convenient for people to work with so that it's not sort of restricted to the hardcore bioinformatics groups and not so much in what I talked about today but embedded with what I talked about in the deep sea work is this idea of transitioning from these matrix models to something more complex that allows us to capture this additional information so now that you're sort of at the end and we're nearing our coffee break time in the mix here a few things that we should cover so what we've talked about so far is in the process is this futility conjuncture which basically says that you can scan a sequence with a predictive model but that individual instances of binding sites are probably not reliable in the data that in order to predict these things you need to effectively add additional information and that additional information can be many different kinds either sets of genes that are co-regulated or specific regions that are evident as having functions so it's finding some ways to bring that additional biological information into the process that lets you work more successfully that looking for enrichment of transcription factor binding sites across sets of genes is a much more powerful approach than looking at any one gene in isolation and that in order to do things well the pattern discoveries well that one's not so much covered today but the discovery will have to incorporate additional information so that's it for what I was trying to convey to you today I hope you got a little bit out of it I see a few sleepy eyes around the room any questions for me before we we move on to the next step question back so that's a great question for the the microphone I'll just repeat it which is are these characteristics conserved across species in terms of the binding properties and the answer is yes so if you look across even vast distances for instance from yeast to humans what you'll see is in the same structural class of transcription factors they will bind to very similar sequences so you are able to leverage the information from one species oftentimes very effectively another obviously there's been some changes over evolutionary time you introduce certain new families at some of the later stages of development and you get massive expansions of some families along the way the piece that is the most challenging in the terms of working with these is zinc finger transcription factors and so for those of you I see some nods in the room saying zinc fingers are the bane of what we do and in humans they are about a third to a half of all the sequence specific binding proteins are zinc finger transcription factors and the reason these are so hard is that a single amino acid change at the tips of these zinc fingers will change the binding specificity of the proteins and so really most of the time the field source is zinc fingers and not zinc fingers in terms of the analysis so outside the zinc fingers there's a lot of consistency inside the zinc fingers there's not so you really need sort of these profiling experiments which will take these protein binding array type methodologies and generate large numbers of ones and we're starting to see some methods that are doing a better job of predicting the target specificity of the zinc finger proteins so we will as we go along the next little while do a better job of figuring out and some of the work that's going on in the CRISPR space sort of is helping with this because it has an overlapping relationship to our ability to understand that very good question any other questions for the morning question about the limitation of the size for these kind of four article types depends on what you're doing how do you define the you always have to balance the different limitations that you're making yeah so it's a and just to get for the microphone or really sort of the question is how wide is the motif that you're looking for how short can you work with obviously it depends on how strong the pattern is within the sequence so you can have a really long pattern that has essentially no information that's about the same as a really short pattern that's that's very well defined so it has a lot to do with that information content properties that you see there and so most of these tools when you're sort of saying where's the pattern basically take these types of plots and they come along and they have some threshold whereby you run out of higher information content positions and so basically start towards the the middle of the motif and they'll start scanning out across that motif until they get to a point where you no longer have high information content positions appearing and they'll say okay that's a good time to cut it off doesn't mean you have zero information content because sometimes out of those positions you have a very subtle information content usually they'll cut them off at some and there's no magic different tools will have slightly different measures in terms of how long a sequence they analyze which is a related question to what you're asking again it depends on the strength of the pattern that you're working with and how noisy you think your data set is so for instance if you're doing an enrichment analysis where you have a set of genes that are co-regulated and you want to say okay how big a region can I scan with my my matrix model that our general approach has been that once you get more than a thousand or two thousand base pairs of sequence most of these methods start getting a little less effective but that's a thousand to two thousand base pairs of analyzed sequence so if you have some sort of filtering that you're doing for instance on open regions or on conserved regions or some other way to refine yourself you can go to longer stretches and because you're only going to be actually working with a subset of that region so but it's it's not and it's not talking 10,000 base pairs and if you happen to work in yeast you should be very happy because the yeast is really the ideal environment for doing these types of analyses because most of your regulatory regions are packed in close to the start and most of the start regions are you know several hundreds of base pairs long and so in the yeast you can do lots of cool things that you can't extend to other species