 So, yeah, the next session is by Wautam Ulaman from MIT and the Broad Institute, and he's going to tell you about how you can take massive collections of Chromage Memo Segway tracks and make sense of them. Hopefully. Thanks, Anshil. Okay, so, yeah, so we've just seen these two pretty cool tools that are very useful, and if you haven't used them yet, I really encourage you to look at them. They've been extremely useful for me to, you know, take our large sets of data for encode or for roadmap epigenomics and kind of make sense of them a little bit better. So, this is typically what we start off with, right, in the most, in the sort of the best case scenario. We have, you know, across a large number of cell types. In this case, this is across 127 different epigenomes. We have, across the whole genome, we have a number of, you know, chromatin marks or epigenomic marks that we measure, and you end up with this sort of monstrous data cube that's not necessarily extremely useful. So, with these two methods, Chromage Memo and Segway allow you to do is essentially to sort of summarize it in that one dimension, right? So, you end up with like a 2D matrix like this instead. So, this is, in this case, this is the output of Chromage Memo, so for given genomic region, across all these different cell types, you see how they behave in terms of epigenomic marks. So, like Jason already showed you, you see a lot of green stuff here. That means it's probably transcribed. You see some red stuff or promoter stuff. You see some yellow for enhancers, et cetera. So, that means that the color you see here, any given color here, corresponds to a state there in the chromatin state model, right? And this is conceptually identical between Chromage Memo and Segway. The problem is, or this is where it becomes a little hairy, is that this is only a very small region of the genome. If you know which region you're interested in, this is great. You can pull this up in the UCSC browser, look at your region and see how that region varies across different cell types. But this is only less than 0.2% of the genome, right? So, if I give this whole matrix to you for the whole genome, good luck. Where do you start? Where are the interesting regions? Where are the regions you might want to follow up with in downstream analyses? At the same time, this is only a hundred and something cell types or different samples, right? I mean, it's very likely that in the very near future, this is gonna run into the thousands or even more, right? So, the problem is that this matrix is not as small as this. It's ever-expanding. So, we need a better method for sort of finding regions that are of interest or regions that are surprising, if you will. So, if you look at this matrix, and this is just a very small portion of it, you can sort of think of this also as kind of like a multiple alignment. So, in this case, this is like an alignment of a number of binding sites, number of CMIC binding sites, which consists of AC, GO, T. And just as this is kind of an alignment of a number of sequences where you don't have AC, GO, T, but you have yellow and green and red and whatever. So, we never show a CMIC binding site as this, right? There's methods we can use to summarize this, and they're called logos. So, we always show it like this. So, this tells us that the C is very well-conserved here, and indeed, there's a C everywhere, whereas, for example, here at the last position, GC and T, it's only GC and T, it's never an A, so we show this as a little bit less high position, right? So, it contains less information there. And the nice thing about these logos is that you can take into account sort of the background frequency of each of these residues, each of these nucleotides. So, let's say you have a very AT-rich genome, then you can sort of take that into account. Now, for nucleotide-based motifs, this is not necessarily super interesting because it's gonna be like between 20 and 30%, usually for each of the nucleotides. But for these chromatin states, it's very important because if we go back, you see that this is the genome white coverage of each of the chromatin states. You see that some states are pretty rare, 0.7% is in the active TSS state, whereas others are covering nearly 70% of the genome, right? So, the nice thing about these logos is we can try and take this into account. If I see a region with a lot of white, it's less interesting to me than if I see a region with a lot of red because red is a much more rare state than white. All right, so let's see how we can take that into account. So, if I show you the same, so, okay, so this is what we're gonna do. We're gonna take this kind of view and we're gonna turn it into a logo like we do for an alignment of nucleotide sequences, except we're not gonna use ACG and T, we're gonna use colors. So, this is the same view as I showed you before, that 2D matrix except squashed, because I wanted to show this zoomed in version here as well. So, it's over there. And now if we use this very simple principle, it's a very simple information theory transformation where you can just turn this into essentially a logo that turns into something like this. And this is what we call epilogos. So, imagine looking at this versus looking at these 2D matrix, right? You can now immediately see on the y-axis sort of the level of surprise or how much information is contained at any given position. And you can see that these kind of regions are not very interesting, but there's a lot of interesting stuff going on here because this particular combination of chromatin states at that particular position is something you would rarely expect to see by random chance alone. All right, so now you can look at this for the whole genome for as long as you want. And here in this case, it's like a silly movie where you show indeed like every color is a chromatin state and the height is sort of the level of surprise. And you can make it even more silly by sort of having the speed of this movie sort of inversely proportional to the amount of information. So, you end up in like a boring region like there's one coming up here and you're gonna go a little faster. It's a super interesting region that's why we're taking it slow, okay? But now, look at that, it's just a little silly gimmick here just to keep your attention here after this long day of tutorials. Now, one of the reasons I'm excited about presenting this today at this meeting is because many of you are the users of Encode data. So, hopefully, many of you are going to be the users of EpiLogos as well as we keep developing this. And for this, I would love to get your input. So, if you have a moment today, either during the talk or just do it during this like deep learning tutorial because you know, take a moment to go here and it's a very short like questionnaire thing just to sort of gauge your feeling of what kind of features for EpiLogos that I'm gonna go through in the coming couple minutes are sort of most interesting to you and we can sort of like steer our development efforts to watch those. Okay, so briefly, this is roughly how it works. It's a very, like I said, it's a very simple transformation. In essence, it's identical to what's being used for generating DNA sequence logos or amino acid sequence logos but you can make it more complex as you go along. So, basically what this tells you, this is sort of the 2D matrix I showed you before. This is the output and it's basically a relative entropy transformation and what that roughly means is that you have for any given position, let's choose this position, you simply tally up for each of the states. In this case, we're using a 15 state model. What percentage of cases is in each state? So, in this case, everything is red so 100% is in state one. Let's P here, okay. And as you can see, you can sort of see this as kind of like an observed over expected ratio roughly, right? So, we're gonna compare this what we observe to what we expect and what do we expect? Well, this is just a very simple way of showing sort of the average genome-wide frequencies of each of the states. So, like I said, the state number one occurs in only 0.7% of the genome and now all of a sudden we see it in 100% of them. So, that's a pretty large difference. So, this is indeed a very surprising position like you see here. It's higher than the other stuff around it, most of the other stuff around it. So, this is the very basic version of epilogos, right? It just doesn't, you know, it doesn't go much further than this, it just sort of assumes that everything is the same across all samples and across the whole genome and things like that. We're also developing more complex models that take into account similarities and expected co-occurrences between chromatin states as well as similarities between different samples you're using. For example, if you're building this based on, you know, 20 T cell samples and one brain cell sample then you kind of want to like weigh those samples a little differently, you know, and put too much weight on the T cell samples, for example. Anyway, so that's stuff we're developing. This is sort of the general concept of it. There's also, and basically it's a very simple model, this basic model, there is this prototype website out there that you can go to and it allows you to sort of generate these epilogos for different sets of samples, okay? So, the one I showed you before is based on all 127 roadmap epigenomes. But here on this website you can select any arbitrary subset of epigenomes. So, this table you see first is kind of like a number of pre-computed scenarios here. And if you hover over each of the cells you would be able to see which epigenomes are included there. And if you don't like any of these pre-computed samples then, you know, by all means scroll down and choose any kind of combination of epigenomes. Now mind you, this takes a little bit of time to compute about 10 minutes or something. So, you know, if you are interested in let's say only like blood and T-cell samples and you can just click the green area here and it'll automatically select everything. Or, you know, so we'll do whatever you want, but you know, if you wanna want things to be a little bit faster than just, you know, choose like any one of these single categories or some combinations of them or things like that. Okay, now what happens if you do that? So let's say we have selected the whole, all the blood and T-cell samples from roadmap. Then the result page looks a little like this. Here, again, sort of the, you know, chrome HMM chromatin state calls, like Jason also showed you in the UCSC browser. And in this case, this is restricted to only the blood and T-cell samples in roadmap. So that's what you're familiar with for a small portion of the genome. This is actually the WashU epigenome browser that's embedded in this website. So it just works exactly the same as the WashU epigenome browser. You can fill in different coordinates or different gene names or a snip of interest and things like that. You can use this to zoom in and out and slide across the genome and things like that. Now, like you already saw here, this is the epilogos version, the epilogos transformation of this set of chromatin states here. So you see that, you know, based on the genome-wide frequencies of the chromatin states, there's some regions that do not seem to be very interesting or surprising in some regions that seem to contain more information. Below here, there's two lists. These sort of make it a little bit more easy for you to navigate to, to jump to places in the genome that might be of interest, right? So here you see, so you're looking at this current window here. What this thing basically allows you to do is sort of find some interesting stuff that's over there or over there, right? Something outside of your current view range. It'll just jump there. So this is recomputed every time. So this is right now, we're at the current window centered here. You can jump something to the left, to the right and things like that. Now here, on the right, you get like sort of the genome-wide global scoring regions. So if you're just interested in like finding, you know, out of all these samples, what are the top 100 most interesting regions in terms of chromatin state variability, then you can look at this. And you can filter this in different ways if you're interested in specific states only. Okay. So this is the very basic version of the prototype website. Now we're working on a number of applications. And this is one of the reasons why I pointed you to that feedback URL because you know, we'd love to get your input on what you think is most useful. So we already looked at the interactive visualization using the WashU browser. One kind of cute application I think, it's a little silly, but as we're growing a number of epigenomes or as we're growing a number of samples, just like we have a reference genome, right? We might also want to generate at some point a reference epigenome. Like what does, in general, what does the human epigenome look like? Maybe. In any case, in some cases, you can think of scenarios in which it will be useful to not show, you know, 100 tracks in a genome browser, but just get a general sense of what does epigenomic data look like in that region. And just like from a motif, we can derive a very simple consensus sequence. We can do the same from a little piece of epilogos here by just reading off sort of the most informative state as we go along. It's kind of silly, but maybe useful. Something that I think is much more interesting is comparative epigenomics. Let me take you through that for a couple of minutes. So let's say you're interested in saying, okay, I have these chromatin state calls for all these roadmap epigenomes, more than 100 different cell types. But now I wanna find, in the genome, I wanna find the regions that are the most different between stem cells and non-stem cells, embryonic stem cells and other cell types, right? And this is something that epilogos is perfectly suited for. So here, these red rows here indicate the embryonic stem cell samples in roadmap. And we're gonna compare these red rows for the whole genome to the blue rows to sort of find regions that are very different between the two groups. So what we do is we select these two subsets, we build epilogos for both of them, and then we use a statistical test for every position to sort of assess how different they are. And in this case, I've conveniently also moved into a region that's already very different between the two. In this case, this is an anti-proliferation gene, B2G2, and you see that indeed in embryonic stem cells that are dividing like crazy, that's actually a repressed region here, the gene is repressed or poised, whereas here in non-embryonic stem cells, you see that the gene might actually be active. Okay, now this is one example where you could say, well, this is a little obvious, if I look here, I can already sort of see that that difference is there, right? But you might wanna use this a little bit more for like an exploratory data analysis. So let's say, you know, it doesn't have to be just like a group of cell types versus another group of cell types. You could do this for any kind of arbitrary subset. So let's do it like, again, red versus blue, okay? Now let's see if we can figure out what kind of comparison this is. So if we run this analysis, we build epilogos for the red rows, we build epilogos for the blue rows, and we start comparing them in every position. You can imagine that we have a plot that looks like this. Across the genome, for every chromosome, you can sort of indicate what the magnitude of difference is. This could be minus log 10 p-value. This could be some kind of test statistic or whatever, right? But let's just say that the higher the levels, the more significant the difference between the two groups are. So based on this, what would you say is red versus blue, any idea? So we're doing a comparison between two groups of samples, male versus female, I hear. Well, that seems reasonable because the vast majority of differences are in chromosome X. And if you look closely, you can see that there's this one peak here. It's like by far the highest scoring difference. And it turns out that that region looks like this. Now it might not be super clear to you that this is an interesting region just by looking at your chromatin state plot. But if you do this epilogos transformation, then you see it's very clearly that there's a huge difference between the two groups. And indeed, this is the exist locus. So just to show you that it's not just for finding differences between groups of samples or groups of epigenomes, it's also that the readout is very visual. It's immediately clear to you not just that there is a difference, but also why there is a difference. Like in terms of which chromatin states does it differ? Okay, so that's one example that one sort of application that I think might be useful. Another one is spatial pattern analysis. And you can sort of think of this as a little bit like de novo motif finding. So let's say you have, let's say you do a chip stick experiment and you have a number of binding sites for your protein. And then you go in and you select these regions of the genome and you do a de novo motif finding. So basically to sort of do a de novo discovery of what the motif, the binding motif of that protein might be. You can do a very similar thing in chromatin state space. Let's say you have a number of regions of interest in your cell type of interest. Let's say you've done some kind of enhancer essay and you found that there is these thousand regions of the genome that really seem to correspond to strong enhancer activity or whatever kind of experiment you're doing, tad boundaries or I don't know. What you can do then is you can take the chromatin state calls in those regions, those thousand regions and just use like GIP sampling or EM kind of approaches to sort of find common shared patterns in there just like you would do for de novo motif searching. And this allows you to actually find all these like mini epilogos. Mini like just like you would have like a sequence motif. You had like a very small like epilogos motif there. And the cool thing about this is that it doesn't just allow you to find these patterns. Again, it's a very visual readout of what the underlying epigenomes look like at those positions, but it also allows you to share them and more importantly scan unrelated cell types. So let's think about this. Let's say you do this crazy enhancer assay that costs a lot of money and you find a thousand regions that are very clear to a very clear enhancer activity in one cell type. But then you want to find other enhancers in other cell types and you don't want to redo this assay. And assuming that there is some kind of epigenomic signature underlying these enhancer regions, you can take your found epilogos where you found found spatial patterns and just scan other epigenomes. So scan other rows in your chromatin men matrix and find very specific instances for these. This one I'm going to skip over, but you can think of it in a similar way as you would do like an evolutionary sequence analysis. You could think of similar ways if you would sort of find patterns of chromatin state changes across differentiation or during evolution. This one I want to spend a couple of minutes on as well. This is to sort of start using your own data along with the data we already have in the system. And this sort of aligns well with your DNA nexus experience earlier today. So the idea is that, this is one of the questions we sometimes get is that if people go to the epilogos website they see all these roadmap epigenomes but they kind of go like, okay, so now how can I get my own data in there? Or how can I, you know, I have a bunch of T cell samples that I would like to compare to the roadmap T cell samples. How does that, how would that work? And the tricky part of that is that even if we would provide, which is something we are working to doing, even if we would provide the software for doing these epilogos transformations, it's still not the same as actually comparing it to the roadmap data because in order to be able to do that you need to process things as you learned earlier today, identically in terms of like read mapping, to the read filtering, QC, like we did for roadmap. So here's the idea. The idea is to actually give people the option of using their own data, this is over, using their own data and putting it in our sort of DNA nexus pipeline that really mimics the processing we did for roadmap. So they throw like 10 gigabytes of their data in there. We never have actual access to their data. This all goes through DNA nexus but the output is about a megabyte of chromatin state calls just for the whole genome for their specific sample based on the roadmap models it'll tell you for every 200 base per region what state it's in. And then of course the dream is to sort of feed this automatically back into the epilogo system so that you can then, just like you would select any other kind of arbitrary subset of roadmap epigenomes, you could actually select your own sample as well. And this would actually allow for, the first time an actual integration of your own data or other third party data with roadmap samples. Oh, yeah, there's something there because DNA nexus was kind enough to already sponsor this kind of processing. So hopefully at least if we would move forward with like actually rolling this out, the analysis you would do through this. So actually putting your, reducing your 10 gigabytes of data to a megabyte of chromatin state calls would actually be free to you. That's some overview of a couple potential uses of epilogos. So I wouldn't say it's up to you but you definitely have a large amount of influence in this. Go check out the website if you want to and please also if you like fill out the short feedback questionnaire. Any kind of feedback you have is super helpful and appreciated. Thanks to, yes, thanks to a lot of people. Most importantly, these people were awesome. These are my students who started working on building this web application prototype and also Ting Wang and his people from the WashU browser for help with integrating it with epilogos and you for being here and of course filling out the questionnaire. Thanks. Yeah, so if you're gonna follow up the analysis with something like DeNovo motif calling you could imagine computationally trying to do both things at the same time. Have you thought about what the advantages of either approach might be? Does that make sense? So basically what's happening here is you're first gonna do a phase of unsupervised learning where you learn some chromatin states and then maybe then you follow that up with something like DeNovo motif discovery. But you could imagine some kind of algorithm which is taking both levels of analysis into account simultaneously and have you thought about whether that is appropriate for kind of Yeah, totally, so yeah, absolutely. So I guess you're saying to sort of like build a chromatin state model sort of have it guided by finding patterns across multiple samples across multiple regions. Yeah, so I would say that's a little bit outside of the scope of this particular project. This really starts at chromatin state segmentations, right? It just builds on top of that. I mean, I'm sure it would be interesting to sort of think of ways of training chromatin state models to find specific patterns you might be interested in, but I think that's a little bit outside of the scope of this, could be interesting. I actually have a technical question for your relative entropy equation. So at the denominator, you have QI, which is the abundance of those state calls throughout the genome. But when you have different genomes, those percentages are not the same from genome to genome, so what do you do? So for the basic model that I showed, it's just an average across all cell types. Okay. But like I said, we have additional models that actually take in which Q is not a vector, but Q is actually a 4D array, in which we take into account all combinations of samples with samples, and also all chromatin states with chromatin states. So then you actually take into account very specific occurrence frequencies for every combination of all things. Yeah, so that's when I mentioned that if you would build an epilogos for 20 T cell samples and one brain cell sample, you don't necessarily, they're not necessarily the same background frequencies. You want to take that into account. Yeah. But we have that as well. Yeah. Okay, cool. Let's thank Walter again for his awesome talk.