 Yeah, so yeah, I'm Luca Pinello. I'm a postdoc in UNLUB. So today we will learn, or we can learn chromatic state from ChIP-seq data, and I hope at the end of the talk you would be an expert on ChromHMM. So this is the outline of the talk. So before learning chromatic states, you know, the question is like, why do we want to, you know, learn chromatic states? And I will talk briefly just briefly about instant modification combinatorial pattern. And then, you know, I will present you the main idea on how you can segment the genome in chromatic states. And finally, we will see how you can run ChromHMM step-by-step and finally show you other methods that are available to, you know, to call chromatic states. So I want to say that, you know, in the encode website there is a word document that I suggest you to download. And so with ChromHMM they provide a toy example. So you can, you know, also in your machine you can run this example. And then, you know, in this document you have all the instructions to run this toy example. And, you know, in a normal computer it takes just 10 minutes. So I don't have to give an introduction because, you know, we have already been talking about epigenetic and chromatic structure and how epigenetic influence gene expression. So one, you know, key idea that, you know, just briefly is like that we have different instant modifications that are associated with different functional regions. For example, we have some marks that are usually associated with denuncers, some marks that are usually associated with promoters or with repressed regions. So now, I mean, also this is something I don't need to explain because we saw already in the previous tutorial how we can get trucks from instant modification through chip seek. And we saw, or you can, you know, align this data and get some trucks genome-wide. And usually, you know, having these trucks, you know, we see that there are some enriched regions. And, you know, usually what people do, so you call peaks, right? And this approach is, you know, really helpful if you don't have too many trucks. So when you have, you know, many different instant modifications, you can still call peaks. But the problem is like it's really hard to combine all this information, right? And another point to keep in mind is that the instant modifications are not independent. So they are kind of redundant so they can share some information. So the challenge is like, if I give you this set of trucks, right, what can we learn from this? So the idea is like, if we can summarize somehow this information in a more succinct way, maybe, you know, it's easier to explore, you know, this particular state. So now, this is a problem that many people try to solve. So now the Chromach M-M, that is a tool developed by Jason Ernest and Manali Skerli, is one of the most popular. So that's why I wanted today to introduce this tool. So now, if you read the definition of Chromach M-M, it can be a little bit scary. So you say, it's a Java program for the learning and analysis of chromatic states using a multivariate mark-on model that easily models the third combination of marks. Seems scary, but in reality it's super simple. So in practice when we talk about chromatic states, essentially what we are trying to do is like, we are trying to find a combination of instant modifications and try to assign a name that is meaningful and, you know, correspond to the functional region. So for example, in this table that is from the original paper, you see that, you know, the rows are chromatic states, and these are the names. So you have active promoter, weak promoter, inactive poet promoter, strong and answer and so on. And the columns here are the instant modifications that they used to call these chromatic states. So now you see that if you, you know, go row by row, you see that different combination of instant modifications correspond to different functional regions. So now one thing that, you know, I want to say now, and this is important to keep in mind, whatever model you run using instant modification will not give you this label. So there is no model, right? So one thing to keep in mind is that you run your model, right? And you will get one table like this. So then you have to, you know, assign labels to these states. And the way, you know, to do this is like to, for example, using other annotation, trying to get some value. So for example, if you see that one state is really enriched at the TSS, probably is a promoter state. But, you know, again, the goal is we want to segment the genome into biological meaningful units. So now if you, if we take a look at the output of the ChromHMM, so the input are a set of chipsick for different distant modification. The output is a segmentation of the genome. That means, you know, that for each region of the genome, you can assign a label or a color. So in this picture, what I'm showing you are the segmentation for different cell types. So each one of these rows is a different cell type. So you have H1, K552 and so on. And the colors represent the chromatin states. So now what you can, what you can see is like you kind of summarize main information of all, you know, this distant modification in just one track. And this is really helpful when you compare multiple cell lines. Because then you can see that, for example, in this region, you know, the cell lines are more similar, but here you have a lot of things going on, right? So now if you look more carefully here, you see that to obtain these tracks, you actually use all the, you know, all the distant modification. So now each row here correspond to the original input that you use to call these states. And here, you know, I'm showing you only four of these cell types that you have here. And you see that, you know, for example, here you have, you know, these marks, that, you know, these repressive, this mark here that is present in this to cell, but it's not present in the other cell. So you have a lot of comparison to do if you want to, you know, look track by track. But if you look now in this representation, it's much easier to compare, you know, these different cell types, the chromatic states. So is this clear or, you know, there is any question up to now? Okay. So, yeah, let's move to the second part. So now I think, you know, I prove to you that, you know, it's really, you know, useful to segment the genome chromatic states. So now many people say, okay, this is really nice, right? But I want to do, to run ChromHMM on my own data or to some data that I downloaded from encode. Because, you know, from encode we have so many data that we can get for free. So, but not always we have, you know, the ChromHMM segmentation. So in the document that, you know, you can download from the page for the meeting, I'm saying you need to, I mean, in order to run ChromHMM, you need essentially just Java. And many of you, if you have a Mac, you have Java already installed. You need to download the ChromHMM software. So many people during, you know, these two days ask me, okay, how can I install this ChromHMM? So since it's a Java program, you don't need to install anything. You just decompress this zip file somewhere and please remember where you decompress. And then if you go to that folder, you will have everything to run the program. And we will go, you know, step by step on how to do that. But many times you want to run the ChromHMM on your own data. So what do you need to do that? So you need essentially aligned Chipsy data sets for different system modifications. So we saw already some pipelines. But essentially, you know, let's say you have some raw reads. You align these raw reads to get an aligned file. Usually it's a BAM file. And then, you know, the only thing to remember is you need another tool that is called BadTools. Because ChromHMM doesn't take in input BAM file. Instead, you know, it takes in input BAM file. So with this tool that is called BadTools, you can easily convert any BAM file to BAM. So now in terms of the workflow, these are the steps that usually you perform to get the ChromHMM. So the first thing is you need to get Chipsy data for different system modifications. So one question is like, which system modification do I need? I mean, there are different, you know, options. But I suggest you, if you go to the epigenomic roadmap, they selected five system modifications. So you can use these five system modifications if you want to design a new experiment. And the advantage of doing that is that they have already trained in the ChromHMM model. So you don't have to retrain your model and it would be really easy to compare with the epigenomic roadmap. So the second step, after you have the data, you need to align again to a reference genome. And please be sure, you know, that you remember which reference genome you are using. Because then you need this when you will run the ChromHMM. So at this stage, you probably have a BAM file. So now you need to convert again to a bad format. And then internally, four, five, and six, this is happening inside the ChromHMM. So it will bin and binarize the tracks. We will see this in a minute. It will train a model and it will infer the states. And finally, I will give you, you know, some idea on how you can interpret the output. So now I will go, you know, step by step. So in terms of aligning, this is not the main point of the talk. But, you know, just the basic idea. You have a FASQ file that usually you get from your Illumina machine. You have, you know, your favorite aligner that can be Botai, BWA or Star. And you get a BAM file. Or, you know, much easier. Since we have, you know, this nice encode portal, you can just go to the encode portal and download a bunch of data and, you know, play with this data. So this is a step that is really important. You will get stuck here. Because, you know, they say, oh, I need to run from each MAM, but, you know, it doesn't take my BAM file. So to convert, you know, from this BAM file to a bad file. And this bad file essentially contains, you know, the same information of this BAM file, essentially the align reads. You need to run this command. So for each of your BAM files, okay, you need to run this command. So let's assume, you know, that you have this BAM file. It's called cell1mark1.BAM. So to convert this in a bad file, you just need to run this command. So you have bad tools, BAM to bad. That means that we want to convert this BAM file to a bad file. The input file. And then, you know, we redirect the output to the output file. And you have to remember where you are putting all these bad files, okay? So, and please put all in the same folder because Chromage and MAM needs to have all the files in the same folder. So now, the first step of Chromage and MAM is like what they call the binarization of the tracks. So this step, essentially, if you have multiple tracks, right, for this Eastern Marks, first you need to decide the resolution. So what they do, they divide the genome in bins and, you know, the default option is 200 base pair. So essentially, you know, you are dividing the genome in this bin. And then, you know, where you have, you know, a strong signal, you will put a one. No, you. Chromage and MAM will put a one. So at the end of this step, you will get these tracks that are, you know, what they call binarized tracks where you have one if the Eastern is present and zero otherwise. So to do that, what do you need to do? So let's assume that you put all your data in this folder. So all your BAM files that you converted, the previous steps, let's assume are in this folder. So to run this step, you need to call, you open a terminal, you say Java, and then this option here is to allocate the memory of your machine. Essentially, it's telling how much memory Java can use on your machine. So if you have many marks, you want to increase this number here. Okay? So another thing to keep in mind is, like, if you're using Java 32-bit, this, you know, you cannot go more than this number. So if you want to run in many marks, please double check that you have Java 64-bit, otherwise it will, you know, fail. And then, yeah, we are calling the main file, ChromeHMM, and this is when you decompress the folder, you will see that you have this file inside that folder. And then ChromeHMM has a set of commands that you can use. So you need to call binarized bed, and this will, you know, perform the operation that I just described to you before. And this B200 means that we want to segment the genome in 200-base pair. But you can change this number if you want. And then this Chrome size 8G18, so this is a folder that you have already in the ChromeHMM package, and these are, essentially, these files contain the size of each chromosome. And, you know, 8G18 should match with the genome that you use in the alignment. So, otherwise, you know, the result will be totally useless. So please double check, you know, the disease matching your aligned file, especially when you download from the code portal. And then what you need to do is to write a file that, you know, they call cell mark file. This can be any, you know, name. So inside this file, you will put different labels. So you need, you know, you have one row for each track, for each bed file. And the first, you know, the first entry is like a label for the cell type, for example cell one. And then you need to input a label for the mark. That may be, for example, 8, 3, 4 monometallation, or, you know, you can put any name, but, you know, you should use, you know, the name of the mark. And then finally, you can put the bed file that you converted in the previous step. And if you have input for your chip seek, you can put here. And as you can see, this is shared usually for all the tracks for the same cell type. Another thing, you know, to keep in mind is like you can put data for multiple cell types in the same file. So this means that you will learn, you know, a model using all the data together. And this is something that you should do if you have, you know, multiple cell types. You should, you know, not learn a model in each one. Otherwise, it will be difficult to compare. And finally, the last parameter is like the output folder. So where the ChromageMM will store these binarized tracks. So this step will do essentially this, just to be sure we are on the same page. Any question on this first step? Yes. Yeah. I mean, you don't have to decide. Oh, yes. So the question is like, do I have to select a threshold when we binarize this data? So internally, you know, they select a threshold for you and it's based on Poisson process. So, but you don't have to specify any threshold. They will do this, you know, calculation for you and you will get automatically the binarized tracks. Yes. Yeah. You can run on Windows, actually. Yeah. You can run on Windows. So if you have Java, you can run. The only thing that is treated on Windows is like these bad tools. You know, you need some compiler to install bad tools. But there is a nice project. It's called Cygwin that you can download. And you can, you know, compile these bad tools inside Cygwin. So you can solve all the conversion inside Cygwin and then you can use, you know, the normal terminal. Yeah. So, I have a question. Do we need to use only uniquely aligned reads or if the reads gets aligned to multiple places, will that interfere with the Chrome HMM's calculations? So usually people, they first filter the reads, taking the unique reads. But yeah, so it's better if you filter, you know, the reads and maps to multiple locations. Yes. That's a good point. So, okay. So up to here, we have these binarized tracks. But still, you know, we don't have the chromatic state. So to get the chromatic states and to train a model, actually we just need another step. So we are almost there. So now we have our, we created this folder, right? In the previous step, we created this sample data, AG18, right? So here we specify as a last parameter, right? This sample data, AG18. So all the binarized tracks are here. So now this will be like the input for, you know, this next step. So now what we are doing, we are taking these binarized tracks. We run, you know, the common layer model from Chrome HMM. And we will get the model plus the segmentation. Okay. To run this is similar to what we saw before. So we have this Java, blah, blah, blah, Chrome HMM. Now you see that the command is different. Learn model and then the input where we have stored the binarized tracks and the output. Where do we want to store the final output of Chrome HMM? So this 10 is the number of chromatic states that we want to learn. So this number, you know, is something that you have to decide depending on, you know, how many marks you are inputting. There is no, you know, rule. This is something that you have to play a little bit with. For example, if you see too many states that are similar, maybe, you know, this number is too big. And finally, you need to specify, again, the genome of reference. So after, you know, like if you run in the example, after 10 minutes, you will see that the output of the, of the, of Chrome HMM. So now the nice thing is like they, they create a nice web page report that contain links to all the output files. So, and usually it's called web page underscore N, where N is like the number of states that you decide in the previous command. And you will get, you know, these three outputs. The first thing is like the model that, that we learned using this instant, instant modification. I will show you in a minute. They reach functional categories. And this is important because you can use this in trying to give, you know, meaningful labels to your state. And finally, the bad files to visualize the segmentation. Okay, let's take a look to, to this. So, so this is essentially the, the model that we learned. So this is the table that I showed you before, right? Before we had the labels. So in the output of the Chrome HMM we get just this. We don't have labels. But, but we'll see in the next slide how we can use annotation. So, you know, each row again is a chromatic state. And the columns are the eastern marks that you use. The transition parameters essentially is like what is the probability if, if I am in, for example, state one to stay, to stay state one or to go to state two and so on. So this is what they call transition parameters. And, you know, this is also another important piece of information. So if you see like the, the state, so now you have another table where you have, again, in the row the state and the column you have different annotation like CPG Island, Exxon, Jimbo, the TSS. So, I mean, this will help you a little bit on, you know, assigning the state. And this is like a meta-profile around the TSS of all the genes for each state. So you see like that using, you know, this data state one and two are more close, you know, to, to gene, like the assessor to gene body. So probably these are state related to, you know, promoter or, you know, what they call elongation. So, in term of the segmentation, so if you look in the output folder of the ChromageMM, you have some bad files. And they are, you know, the bad file that you need will be something like, you know, the cell name, the cell type name that you specify and then we'll have underscore dense. There are many bad files. So the one that you need to visualize the output are the one that end with underscore dense. And this is in my Word document, so you don't have to remember. So now to visualize these tracks, you can upload them on the genome browser. We had a really nice tutorial yesterday on how you can visualize bad file on genome browser. Or, you know, if you want to visualize in your machine, there is a really nice software from the Broad Institute. It's called IGV that you can just download your machine. And in order to visualize this file, you just open the software, you select the right genome, and then you just drag and drop the bad file on this window. And for each cell type, you will see something like this. And, you know, again, the colors correspond to the states. And, you know, here you have like a track with the gene, and you can load also other annotations. So do you have any questions up to here? Yes. So how long does it take for you to run the ChromageMM? So, I mean, this depends on how many marks do you add. And, you know, it can be like, you know, for one cell type, usually it's like in one hour it will finish. If you have like five marks in a normal machine, not super fast, in, you know, one hour, two hours. One thing to keep in mind is, like, if you read the documentation, they just, you know, release a new option. That is called minus speed. That means that you can run using all the colors on your machine. So it will be much faster to run. But, yes. So in your instructions, you were talking about using this with histone modification data. But this has also been used with TF data, and it should work with any data like DNA, or KH, or anything, right? Yeah. In general, it's like a fancy cluster. So whatever you put in, you get, you know, combination of, you know, states that will be defined based on histone modification, plus prescription path, or chromatin. So I'm not sure if this is crazy. So is it possible just, I don't know if it says it's here. It's possible we can run this thing on the DNA indexes, that everything automatically. You just click a few files. So it gave me the ChromageMM. It gave you, five hours later, it gave you HTML. I think so. It can be automatic. Yeah, I think so. Because, you know, you just need one file, so what they can, they will do internally, they can create a pipeline where they essentially convert from BAM to BED with BED tools, and then we'll run the two steps of ChromageMM, and then we'll just output the web page. So, yeah, it's totally possible. So, yes. So to create your binary files, I said, you don't want to just use one cell type. Do you have, like, a minimum number that you should include? No, no, actually, if you're interested in just one cell type, you can do that. What I'm saying is, like, if you want to compare chromatic states in multiple cell types, you want to learn all together. So that's, you know... So you wouldn't want to look at the previous Chrome and compare it to your cell type? So if you want to use a previous model, right? So if you want to use the model trained on the epigenomic roadmap, which is a really robust model, then you want to, you know, use the same Eastern Marks, the same Alignment Step, and all the steps that they follow. So then you can just segment your, you know, your new cell type. So that's the way. But if you have, like, I don't know, three or four cell types and you want to learn chromatic states then all of them, maybe using some description factor, it's important that you learn all together the model. Otherwise, you will have a different model for each cell type and maybe misleading then, you know, to compare the segmentation. One last question. Very quick one. So can you go back and comment a little bit on the segmentation? Like, you have, like, repressed insulators. What are those different possible outputs? So these colors are, you know, this is from the paper in nature, chromatic state nine cell line. So they define this based on, you know, different annotation. So if you run with different Marks, you will get probably different labels, right? So, you know, it's not something general that, you know, I can explain now to you, depends on what you put in the model. So I just want to finish, you know, this talk with these references. So Chromage and Mem is not the last model, it's not the unique model to do this. There is segue that is another model and, you know, you can run this model, you will get, you know, single base pair resolution. The only thing is, like, it's a little bit slower than Chromage and Mem. Then there is a new method, it's spectacle, that is similar to Chromage and Mem, but it's much faster because they use spectral learning. And then, you know, in the lab in collaboration with Manolis, we are developing an extension of Chromage and Mem that essentially try to solve the problem of the resolution. So you can learn Chromatic state simultaneously a different resolution. So you have, like, the nuclear zone resolution, but you have also the domain resolution. Because you have, you know, sometimes different pattern and different scales. And I just want to thank, you know, Manolis and Jason for this nice tool and, you know, I will stop here. Cool. Thanks, Luca.