 Okay, let's settle in everyone, we'll be starting in about two minutes. This is apparently the advanced analysis tool session. So hopefully you guys will have a lot of fun with this one. I think there are going to be four interactive workshops. The first is going to be presented by Jason Ernst on his famous Chrome HMM tool, which is very useful to integrate various types of histone modification data sets to infer chromatin states in genomes. And yeah, so I'll let Jason take it from here. Thank you, organizers, for giving me this opportunity to speak with you. So this tutorial is going to be divided into three parts. I'm going to first give you some general background on chromatin states and chrom HMM. Then I'm going to talk about how you can access existing chrom HMM annotations of the human genome. And then finally, how one could go about running chrom HMM on your own set of data. So as we've heard throughout the workshop about some of the histone modification that ENCODE has generated, and there's multiple different types of histone modifications in terms of the histone protein, amino acid residue, and the chemical modifications. And each modification can give you some indication of what type of genomic entities are active or repressed in certain cell types. But in general, there's more than one mark being mapped in any given cell type. And we had reason that by integrating multiple different tracks and reasoning about their combinatorial and spatial patterns, we can take that information and then give a systematic annotation to the genome. So both discover the patterns that we're observing and then assign each location in the genome to being instance of some pattern. And we turn this chromatin states. So the underlying model for this was based on a multivariate hidden Markov model. So what we did was we preprocessed the genome into 200 base pair, non-overlapping intervals. And we made a binary presence or absence call if we had enough reads supporting some modification being present based on a plus on distribution background model. And then we make this assumption that there's various biological entities underlying the genome that might be reoccurring, whether it's enhancers, gene starts, gene bodies, that we don't observe directly. We just observe these histone modifications. And we do this all unsupervised. So what we do is we discover what are sort of the major patterns associated with these types of biological entities. And formally what we do is we have states of the hidden Markov model. So these are hidden states that are associated with different emission probabilities. And we model the missions with the product of independent Bernoulli random variables. So depending on what state we're in, we would have a different probability of observing each modification being present. And then there's also transition probabilities between the states. So for example, we could differentiate that we're in a different state here than here, even though we didn't observe any marks at either place, just based on the spatial information. So in the application we had back in 2011 was a collaboration with Brad Bernstein's end code production group where we had mapped nine marks across nine different human cell types, end code cell lines. So this is looking at all the data in one single location. Different colors corresponding to different cell types and each individual line was a different track. And what we did was we conceptually applied the same modeling approaches if we were looking at one cell type, but we concatenated the cell types, treating them as if they were different chromosomes. So we learned one set of state definitions across all the cell types, but then we would have cell type specific state assignments. So what you're seeing here, the rows correspond to different states. The columns correspond to different input modifications. And the values here correspond to what's the probability we would observe that modification if we're in that state. So blue means higher probability. Once we learn one of these models, then we can go across the genome, assign each location to being an instance of the state, and then compute various enrichments. So for example, we might see some states that are heavily enriched closer to gene body or promoters or genes. And then we characterize based on the modifications and the enrichments these states into various classes of promoters, whether it's more of an active or inactive or poised, different classes of enhancers, insulator regions, some weaker transcription, heterochromatic regions. And one thing I want to point out is that sometimes we might see two states which have relatively low signal, but the transition structure which I'm not showing you directly can be very different from them. So these are cases where you can have two different classes of broader domains that have very different types of properties. So even if you see two states with low probabilities, that doesn't mean that they're not distinct from each other. And they can both represent large parts of the genome. So this is another view of what we've done. I'm taking the same gene in four different cell types. Each row here corresponds to a heat map of the original input data. So darker intensity means a higher presence of that mark. And then we can summarize this data into a single chromatin state call or color coding. So in this case, at the time we had nine cell types and somebody could go across some location in the genome and then quickly get a sense of what type of chromatin state it's in. And you can start seeing variability here. For example, this gene promoter, yes, it's in a poised state. It's in a repressed state here. It's sort of empty in active state here and more of an active state in these five cell types. And these tracks were made available on the UCSC genome browser. And then we had used these to start interpreting disease associated variants. And we're finding enrichment for certain states, some of the cell type specific enhancers for GWAS variants. And it's been used in the recent paper in the New England Journal of Medicine to interpret the FTO, OSI, and many other applications in the literature of using these chromatin states to interpret disease associated genetic variation, an epigenetic variation. So now, can you go about accessing some of these chromatin state annotations? I'm gonna focus on how to access sort of the largest collection of chromatin state annotations in terms of the number of cell types. So we had some previous models based on six or nine cell types and they're available in the UCSC genome browser as well. But I wanna walk you through this chromatin state models or a couple related ones that were part of a paper based on the roadmap epigenomics consortium effort, which was based on 111 reference epigenomes produced by the roadmap epigenome consortium, as well as 16 produced by ENCODE phase two that were reprocessed through the same pipelines as the roadmap to give a uniform set of chromatin state calls. And this is showing you here in this color coding different high level groupings of these cell types. So we had coverage of very diverse types of human cells and tissue types. And we had a model here learned on five core histone modifications which were mapped across 127 of these reference epigenomes. And this is showing you a browser view at one location of these. And they were classified into, again, different types of promoters, enhancers, gene regions, repetitive and heterochromatic regions. And then we had another model, which was based on six marks, including H3K27 acetylation. And this one had 18 states in this model. It was defined on 98 cell and tissue types. One thing you might notice here is that this matrix is incomplete. And if we wanted to keep on defining these chromatin state models on a set of cell types for which we have the same marks on every cell type, and we keep adding marks, we would get fewer and fewer cell types. So instead, we used an alternative strategy where we first imputed the epigenomic data. I won't have time to go into the details about this method, but it was published in 2015. And the idea was we could leverage the fact that we had, in any given cell type we were interested in, other marks mapped in it, and we had that mark mapped in other cell types. So we could use both types of information, how that mark behaved in other cell types and how other marks behaved in the cell type you're interested in to figure out computationally what one of these chip-seek or related experiments should look like without actually doing an experiment. And then once we had the imputed data, we learned a chromatin state model uniformly across the imputed data for 12 marks for which we had the most data to perform the imputation on. And we defined 25 different states. And this model is described in this paper. And this is a view of the chromatin state annotations across all these 127 cell and tissue types. So just to summarize this, we now have two models based directly on the observed data, one based on five marks, one based on six marks, including H2K27 acetyl. And now we also have one based on 127 cell and tissue types based on 12 marks, but based on imputed data. And this is a summary of the chromatin state annotations. So we have a color coding of related states and then a candidate annotation. So when we give one of these annotations, that is sort of a semi-automated part where the human comes in and describes these states based on the enrichment and such. So it shouldn't be taken too literally, but it gives you a sense of the differences between these states. So now how can we go about bringing us up in the UCSC genome browser? I won't do a live demo because the internet's a little shaky, but you feel free to try on your own computer. So you need to be in the UCSC genome browser in the browser view. And then you wanna be in HG19. And then you would wanna click on the button track hubs in order to access this menu. So once you go to the trap hubs, you'll see a list of possible sort of places you can connect to. And you wanna connect to Roadmap Epigenomics Integrative Analysis Hub. And just to know if there's another track hub which sounds similar and has some overlapping data called Epigenomics Data Complete Collection, but don't click that one, click this one, Roadmap Epigenomics Integrative Analysis Hub. And you click connect. And these slides should be available on the website if you need to sort of go back to something as I go through this. And once you click connect it should take you back to this gateway screen. And then if you click go, it'll take you back to the browser. And then you should now see this menu here, Roadmap Epigenomics Integrative Analysis Hub. And then you'll have this one here, this first one on that menu bar where it stands for consolidated by assay. And you would wanna have that do show and then you would wanna click on it. And then you would wanna click on the first row there, Chrome HMM and make sure it's set to show dense checked. And then click on that Chrome HMM option. And then this is where you have the choice to view. So the primary is the one based on the five marks directly on the observed data. Auxiliary is based on the six marks on 98 of the cell types. And you'll see there's some missing boxes. So those are ones that didn't have H3K27 to settle. And then the imputed one is those based on the 12 marks using the imputed data. And then the data type here needs to match or it's okay to have both checked. So by default, you can click both of them. And then in this case, I'm click this plus at the very top to highlight all the imputed tracks. If you were just interested in couple cell types, you wouldn't necessarily have to click everything. And then I've set it to dense. So with this, it's gonna bring up all the imputed tracks. So you hit submit after you've checked that. And then you have now on your browser a view of the chromatin state annotations across these 127 reference epigenomes. And these slides are on the website if you need to sort of go through one of these steps again. And I'll be available at the session this evening if you got hung up on some step. So now I'm gonna talk about how to actually run the Chrome HMM software if you had new data that you wanted to have chromatin state annotations for or you wanted to process some existing data in a different way. So in this case, you would wanna go to the Chrome HMM website and download the software. It's about a 30 megabyte file. So, depending on what the connection speeds are like right now might take a few minutes to download if you're interested in running it right now. And just to point out some other things about this website, there's a manual here which has all the sort of details about a lot of the commands and options that I won't have time to talk about. And we also have some other things, some links to existing chromatin state annotations, the ones I mentioned through the roadmap portal as well as direct links to the UCSC browser for some of the older models based on the encode cell types as well as some chromatin state annotations produced by Ross Hardison's group in mouse as part of the mouse encode efforts. And you can also subscribe to the mailing list to get announcements of new versions. So, if you've downloaded the software you'll get a file from HMM.zip and what you would wanna do is you would unzip it so this is assuming you have Java already installed on your system. Except for Java that there's no dependencies otherwise with chromHMM. Then after you unzip it you need to open a command line and then you would wanna change into the chromHMM directory where the chromHMM.jar file is sitting. And then you would wanna enter this command and I'll show this command again but if you're following along on your computer now's a good time to enter it. So it's all one line. And this is gonna run it chromHMM on the sample data. And I'll walk people through what all these options mean but it takes a few minutes to run. So if you're following along on your own computer you would wanna type in this command and it's also on the slides if you haven't copied this yet. So I hit enter and then it's gonna start learning a model and it's giving you progress updates and it's actually as it's running it's writing the latest model found to the output directory. So if you were eager to start seeing the model and potentially just wanted a quick version after a few iterations it already can give you a good sense of model and then it starts making incremental improvements later on. So now to discuss the input to chromHMM. So the actual modeling of chromHMM is based on binarized data. So chromHMM modeling part sees sort of ones and zeros at some resolution the default is 200 base pair resolution. And I mean there's multiple ways of producing that but the recommended way is to give chromHMM a set of aligned reads. And this can either be in BAM format or if you have it in bed format too. And it has a different command either binarized bed or binarized BAM depending on the format of your aligned reads but otherwise the commands are the same. And I'll walk you through this binarized bed command and it's the same if it was in BAM file. So you have here the Java to tell you're using Java then you specify the amount of memory Java should have access to dash jar and then chromHMM.jar. And depending on sort of the size of your project you might need to increase the amount of memory. Then this is the command to chromHMM so there's several high level commands so binarized bed and binarized BAM and learn models or examples. And if we have time I'll talk about some of the other high level commands and they're all described in manual. Then there's a file which has the lengths of the chromosomes so this needs to point to the chromosome length file. So there's a directory chromSize which has a number of files already preloaded. If you're gonna be working with a different assembly that chromHMM doesn't support by default then you need to download it but most standard assemblies are already in that directory so you just need to specify that file. It might be a forward slash on your system. Then a directory of where the bed files are sitting or BAM files if you had BAM files then a file here which sort of gives the overall design and chromHMM so the first column this is assuming if you're doing the learning in a concatenated form where you had multiple different cell types the first column would specify the different cell types then you would have the next column the different marks and then the bed or BAM file corresponding to that cell or mark file and it can also take zip files as well and then you can also specify a control file and this is optional and you can also use control data as a feature as well and if you specify control here then it'll binarize the data taking into account the background level of reads. And then the last option required option is the output directory where the binarize data should be written and this is the output of chromHMM so if you were running learn model on the sample data you would get a report like this I'll talk about this output in a second but just going back to the presentations how you get to that so if you didn't copy that command in before this is again so to walk you through these options so that this part is similar to before and now learn model is the first high level command this is one non-default option so I set it to dash P zero and what this allows chromHMM to do is access as many processors as are available on a multi-processor machine so this is one way to speed up chromHMM learning is to run it on a multi-processor and give it a number of cores. You don't specify anything it'll just use one core and has a slightly different learning method that doesn't use the parallelization and then you can also specify exactly how many cores you wanted to use then there's the directory which has the binarized input that would have been produced by binarized bed or binarized BAM then you have the directory where you want the output files to go so this is where you'll have the segmentations chromHMM produces as well as the model files and some automated enrichments it computes then you specify here the number of states and in practice oftentimes what we'll do is we'll learn models with different numbers of states on a cluster and then sort of compare what types of states are coming out of it and sort of at the level of biological interpretation sort of choose one model to analyze in more depth and then we specify the genome assembly so this was sampled data was from an older assembly HG18 but you can specify other assemblies here as well and now if you had run chromHMM on your computer you would have had something like this automatically opening your browser if you had a browser enabled otherwise it'll just write this to the directory and what it gives you is an output here where the first part is a summary of what the options were so you have this record of what you ran chromHMM with so these were sort of the required options and this was the full command then if you look here right under model parameters you see the emission parameters of the model so this is showing you in darker shading if we're in that state we have a higher probability of observing that mark and you can in addition to having this in a regular image format you can also get an SVG format or download it in a tab delimited text file and then you can also view the transition probability so this is telling you if you're in some state what's the probability you would be in that state at the next location in the genome and that's how chromHMM is able to capture a lot of spatial information and can often differentiate between two states that might have similar low emissions but some subtle differences could make a significant impact with the transition information to capture broader domains then you'll see the model parameter file and this isn't designed to necessarily be sort of human readable but if you wanna run chromHMM again and produce the segmentation without relearning the model this file's useful for some of the other commands then you have the segmentation files produced in multiple different formats so the standard one which is sort of easiest to use for additional computational processing is the chromosome, the start and the end coordinate embed file and then these are the annotations so the number here corresponds to which state it is the e corresponds to the states were originally ordered by emission there's options to sort of reorder the states and then it can get a different prefix and then because we ran this on two different cell types we have two different segmentation files one for each cell type then we have a set of browser files where one can take these files and load them into browsers such as UCSC Genome Browser or IGV and there's two different types of formats one is a dense file which allows you to view in a single track with different color coatings of different states the other one is where you have one state per line and that lines help fix then you'll start seeing some automated enrichments available that ChromeHMM computed based on a set of coordinates that were pre-loaded you have the option before you run ChromeHMM to add additional other coordinates you're interested in comparing this segmentation to and it'll show that or at a later time there's another command overlap enrichment where you can run that with the segmentation and just get additional enrichments computed and what this is showing you is that the relative enrichment for different categories in different states and then if you actually click on this file you can see numerical fold enrichments for each one of these categories and it also provides positional enrichment plots for example it's showing you that certain states are more enriched over transcription start sites than others as well and again all these files can be downloaded and also text or SVG format and then you have enrichments again or based on each cell type so we'll have different enrichments for each cell type and then in the remaining few minutes I just wanna mention briefly some additional commands that are available in ChromeHMM so these would be sort of first level commands analogous to the binarized bed or BAM or learn model so the first one we have is compare models so if you've learned a set of models with different number of states you can ask for a fixed reference model so usually this might be the model with the most number of states to what extent are there states and models with fewer numbers of states for which there's a correlation between some state of that model and each of these states in the larger model so for example what this is saying is there's some state here in this model of 51 states for which all these other models are heavily correlated in terms of the emission parameters but then once you go below 34 states it doesn't correlate well so that could suggest that you're potentially missing whatever this state represents so you might wanna go and inspect what that state is and then if you feel that that state is sort of biologically important to your analysis then you've effectively established a lower bound on the number of states you should choose so this is a quick way to get a sense of some of the trade-offs between different states you can run, regenerate these browser files and use a different color scheme and you can also add labels to the state so give them state names and then that can be incorporated into the browser file and then there's also overlap enrichment which I briefly mentioned before which allows you to rerun the enrichments that automatically get computed but now you can also add additional files that you might not have sort of had preloaded into the input coordinate directory and you can also do a similar thing with run neighborhood enrichment after you have the model segmentation produced you could rerun that later with a set of anchor positions and get positional plots and you can also reorder the states of the model so ChromeHMM has this default ordering the states but you can also decide that you would prefer them ordered in a different way and you can reorder the model so just to summarize I presented to you a method for annotating genomes based on integrating multiple different epigenetic marks and we've applied ChromeHMM to more than a hundred different cell and tissue types and those annotations are available on the UCSC Geno browser and they're also available directly on the roadmap portal which I didn't have time to show you and then ChromeHMM software is available for you to run on your own data and that's the URL and I'd like to acknowledge a lot of this work was done while I was at post-op with Manolis Kellis and I continued some of this work after I'd moved to UCLA a lot of them could work was done in collaboration with Brad Bernstein and his production group and then there was a whole lot of people within the roadmap epigenomics consortium involved in producing and processing these data so thank you for your attention. I thank you very much for your talk I have a question about Chrome Impute so it would make sense to me that germline cells might have correlations of features it also makes sense to me that cancer cells might be a lot more whacked off and show fewer correlations is that something you looked at in making Chrome Impute or can you comment on that? So the question is I mean to what can we like in some types of cells better impute data than in other types of cells? Yes Yeah, so I mean the roadmap data was largely normal cell types and then we didn't have a few encode like cancer cell lines we didn't notice like a large difference for those but we didn't do sort of primary cancer cell types I can't sort of comment exactly on how it would perform on that but I mean in general what it the assumptions behind it is I mean the correlation structures between histomarks is relatively well preserved across cell types so if you have some sample for which that assumptions no longer true and like marks don't correlate with how they're usually correlated and you haven't seen that mark at that position in any other cell type then it would be very difficult to impute with the assumptions making. Yeah, two things I guess the first is I'm wondering to what extent you've observed the quality of the chip seek data influencing the inferred hidden states and then secondarily if you look at sort of biologically meaningful subsets of the data how stable are the hidden state definitions? Yeah, I mean so I guess it's sort of if it was garbage in garbage out so I mean it's really gonna depend on the degree of the data I mean one thing with the chromatin states is that you're making the annotation based on multiple different tracks so if one of the tracks is sort of meteoric quality but you had enough other tracks it can sort of give you a reasonable set of state assignments even if not every track is ideal and then in terms of sort of robustness to like subsets of the data we have done things where we've taken a model and we've sort of taken the two different replicates associated with it and then see to what extent the state assignments agree between the replicates and I mean we get relatively good agreement you'll have some states that are more similar to each other and you'll have some confusion among those states. I was wondering about normalization across tracks like if some tracks have more reads than others how does that affect the state calling? I don't know if you've already answered this question. Right, I mean so if you had sort of if you had sort of arbitrarily many reads in one experiment then and you just use a default Poisson binarization I mean at some point it would start calling sort of present for anything with the sort of very weak fold enrichment and sort of the limit. What we did in the roadmap consortium effort is we sort of subsampled all the experiments that were above 30 or 50 million reads depending on the mark to have a little more balance between that. So if you want to subsample the reads you can do that outside of Chrome HMM and then just give it the subsampled reads. And then... Yeah but I mean... You recommend starting with a similar number for all... Yeah I mean I don't think it's very sensitive to I mean small differences but if you have sort of gross differences I mean that is something you should be aware of and the other thing is Chrome HMM has an option to add a fold enrichment cutoff which can also sort of handle the problem if you had sort of extremely deep sequence you can just sort of add a fold enrichment cutoff and that's another way to handle it without subsampling. Okay let's thank Jason again. Move on to the next session. So the next workshop will be presented by Max Liberist who's from University of Washington Seattle and he'll be talking about a similar approach called Segway that also essentially can provide Chrome HMM state maps and others related tools in the Segway family, right? All right, thanks Antruul and thanks for having me here to present. I'm gonna present today a tutorial on a suite of tools that the NOBLab developed called Genome Data Segway and SegTools. So in this tutorial we're gonna assume that we've just performed a bunch of genomics assays on some new cell type and we've gone through the process of mapping and making signal tracks from those cell types so that we have a bunch of signal data that might be in bed graph format where each assay is represented as a real valued track over the genome. So I'm gonna show you a pipeline for how you might make sense of these data sets. First a tool called Genome Data which is used for storing and compressing genomics data sets. Segway which is a tool that's similar to Chrome HMM for annotating the genome based on genomics data and then SegTools which is a suite of tools for making plots and analyzing any type of genome annotation. And I should say that all of these tools are independent from one another. You can use genome data to store your data even if you're not using Segway and you can use SegTools to analyze any type of genome annotation not necessarily just Segway or even Chrome HMM annotations. So these tools were developed on Linux. I don't know if they work or not on Mac OS we haven't done much testing with them but they probably do because the platforms are pretty similar. So we'll assume we have a Linux machine available. If you don't personally have a Linux computer yourself you can always get one for example, Amazon EC2. And then to install all these tools you're just gonna run these commands on your command line to install them on your computer. I'm not gonna go through all the commands one by one but I think these slides are available so you can go through these. But mostly it's just installing these, the tools they're all in Linux package manager packages. And then also the documentation for all these tools are available at these links. So here genome data, Segway and SegTools. And again these are, these slides should be available so you shouldn't have to write down these links. Okay so let's start with genome data. Genome data is a tool for storing and compressing genomics data sets. So a genome data archive is just a big binary file that lives on your file system and it contains inside it a bunch of genomics tracks represented as real values over the genome. So you might have one track that's genome 12878 H3K4 trimethylation, a histone modification that's this track over the genome. And one genome data archive can store all of your genome data sets. So the key feature that makes genome data really useful is it supports random access. So you can, any position you want in the genome you can just ask the genome data archive to give you the data associated with that position and the tool doesn't need to read through the whole data set or load the whole data set into memory in order to query. It also compresses the data. And the way it does that is through a binary format called HDF5 that's built for high performance sort of a floating point storage. So to load your data into genome data the first thing you're gonna do is you're gonna use the genome data load assembly command. You're gonna tell it what you want your genome data archive to be called and you're gonna tell it the genome sequence for your genome assembly. So you can get this genome sequence for the human genome from this link from UCSE. Genome data also has commands where if you don't wanna download the whole human genome sequence you can just give it the lengths of the chromosomes. So that will generate an empty genome data archive for you and in order to add your data to it you're gonna do this process for each track. So HDF5 is a little funny in how you have to load the data. You have to do three steps. You have to open the box, put stuff in the box and then close the box. So there are three commands genome data open data. You give it the name of your archive and the name you want your track to be called. So in this case we're putting in data for H3K4 trimethylation in GM12878 and we can call it whatever we want. This is just the name we're giving it. We run this command genome data load data and input the bed graph data that we produced and this is all one line by the way just didn't fit on the screen and then we run closed data on our archive. So once we've done that we have all our data in the archive. Now how do we use it for any kind of downstream analysis we wanna do whether that's running segue or something else. So first thing we wanna do is query the data on the command line and you can do that just using this command genome data query. You give it your archive, the track you want and the coordinates you want and it will output the data in wig format. So just here's the values from that position. So just grabbing some data from your archive. Genome data also has a Python interface and it works the same way. So to use it you're gonna import the genome data Python module. This is after we're inside segue you can just get here by typing in Python on your, sorry we're inside Python. You get there just by typing Python on your command line. You're gonna make this genome object using this command. Now it's important to note that this command doesn't load the entire genome data archives data onto your computer. It's just sort of opening connection the same way you'd open the connection to a file. And then for whatever coordinates you want again we're gonna take chromosome one these coordinates for this track. We can just ask the genome data for this, these coordinates and it's gonna go get them from our computer. Okay and a couple other commands that are handy for genome data. These are useful if you forget which assemble you defined your archive on or which tracks you have. So we can run this command genome data info and say track names for our archive and it'll just give us the list of track names. Or we can run genome data dash info contigs data dot genome data and that'll give us the coordinates that our archive is defined on. Okay so I'm gonna move on to the second tool I'm gonna present called Segway. Segway is another, like Chromatrium M, Segway is a semi-automated genome annotation algorithm. So I'm not gonna go into a lot of de-stall on this because Jason just presented Chromatrium M but just as a recap a semi-automated genome annotation algorithm takes as input a collection of genomics data sets and produces a segment and labels the genome such that positions with the same label have similar patterns in the signal data. And then we call these tools semi-automated because what the algorithm gives you are just integer labels because it's unsupervised. It doesn't know about things like promoters and enhancers. And then it's up to a human to interpret that maybe label one is actually a representation of enhancer, maybe label two is a representation of exon and so on. So the best, the most well-known tools that do semi-automated genome annotation are HMM Seg, Chromatrium M and Segway. Chromatrium M and Segway are pretty similar. The main differences is that Chromatrium M uses binary data and Segway uses the real floating point data. And then also Segway can be run down to one base pair resolution. So to run Segway it's gonna look a lot like running Chromatrium M, it has a two steps, a trained step and an identified step. The trained step goes from the data to a model and then the identified step produces an annotation from that model. So you're gonna run these two steps and we give it as input, a genome data archive and the output directory you want it to go to. For the identified step we give it the training directory and also the identified directory that we want the annotation to be output to. And then as output we get this file which is in bed four format. So it's just for each, along the genome it has a chromosome, a start and end and an integer label for each position in the genome. Segway is designed to use a compute cluster so it supports either grid engine or platform LSF. If you have a compute cluster it's probably is running one of these two cluster engines and Segway will automatically determine which type of cluster you're running and it will automatically use your cluster. But if you wanna run Segway without a cluster you can just set this environment variable in your shell and that will tell Segway to run just on your local computer which is handy for testing. So I'm gonna go through a bunch of options you can give to the Segway program to alter its behavior. So the first thing we're gonna wanna do is tell Segway which tracks we want the annotation to be defined on. So by default, Segway will run on everything in your genome data archive which isn't usually what you wanna do. So you can use these options track equals some track name to tell it which tracks to use. You can specify this command multiple times to specify multiple tracks or you can put a long list of tracks into some file and reference that file using the tracks from option. So in this case tracks from is this or sorry tracks.txt is this file which has these two track names in it. You can change what coordinates you want Segway to run on by default it's gonna run on the whole genome as defined by the genome data archive. But sometimes especially for testing you might wanna run on just a subset of the genome. You can do that with the include chords option and you give it some bed file which again is just in this format chromosome start stop. Then usually so one thing we've known in encode is that there are some parts of the genome that just have weird artifactual behavior when you run chip seek on them. So Entral has developed some blacklist coordinates. They cover on the order of 1% of the genome and they're just positions that have weird artifactual behavior. And it's usually a good idea anytime you're training any kind of model to leave these coordinates out of your analysis. And you can do that with Segway just with the exclude chords command. You don't have to remove them from your bed file yourself. You can get those blacklist files from this URL. Then now it also used to be that we would train Segway on 1% of the genome. That's just because there's a lot more data in the genome than you really need for every iteration. It's inefficient to perform training on the whole genome at each iteration. We used to train on just 1% of the genome. A new feature we've added is that instead of training on a fixed 1% of the genome we use what are called 1% mini-batches. So Segway will train on a different 1% of the genome at each iteration. So that means it will still have access to all the data in the genome but it will still have fast iterations. Okay, now some parameters you might wanna change for your annotation. The first thing you can change is the number of labels you want Segway to assign. So if you give it just two labels you're just gonna get, of course, an annotation of two labels and more labels is gonna be a more complex annotation. We've generally run Segway on the order of 10 to 20 labels. Anything between maybe four and 50 is a good number. Segway uses the expectation maximization algorithm which is not guaranteed to give the optimum. So what you can specify with Segway is how many different times you want Segway to start from a new initialization. It'll start from multiple initializations and pick the best one. And this parameter determines how many times it'll do that. I recommend, we usually use 10. And you can also specify the maximum number of training iterations you want. The smaller the number, the faster training will go but you might get a worse model out. Okay, you can also control the average lengths of the segments that you'll get out of Segway. So depending on what you're interested in looking at in your data, you might be interested in segments on different scales. So if you're looking at things like promoters and enhancers, you're probably looking at segments that are on the order of 1,000 base pairs, 1KB. If you really wanna dig down and look at really the structure of each promoter, maybe finding where the transcription factors are binding and where the nucleosome free regions are, you might be interested in much smaller segments all the way down to maybe 10 or 20 bases. So you can control the segment lengths using Segway three ways. One is to change the resolution that Segway runs at. Segway supports all the way down to one base pair resolution. We've used anything up to 10,000 bases. We've used that when we're looking at not things like promoters and enhancers but large domains on the scale of a megabase. The resolution, the higher it is, the faster, the higher this number is, the faster your training is gonna go. But of course, it will down sample the data and so you're losing some information by doing that. There's also, you can put a prior on having long segments. That prior can go, it's just some number and higher will on average give you longer segments. And you can also change the weight that the model puts on the transition part of the model relative to the emission part of the model. So putting more weight on the transition part of the model tends to give you longer segments. And so again, you can increase this to increase the segment length. Usually you want this to be about the number of tracks. So if you find you run your annotation and your segments are too short for your liking, you can play with these parameters to either increase or decrease your segment lengths. Okay, so that was how to run Segway. Now I'm gonna present segtools, which is a suite of commands for analyzing annotations. And like I said, this can be used for any type of genome annotation. It can be used for segway or chromatronome annotations or if maybe you've produced your own annotation of the genome you can use segtools as a really easy way to make plots and analyze those annotations. So segtools has a bunch of different commands. I'm just gonna go through three of them. Signal distribution is one command that measures the relationships between annotation labels and signal tracks. So you give it some annotation, which will be in again that bed for format that segway uses and a collection of genomics data sets in genome data format. And segtools signal distribution will produce a plot that looks like this with labels on the horizontal axis, tracks on the vertical axis, and the color indicating the strength of association between a given track and a given label. So in this case maybe I just made these up, but in this case this label might be associated with H3K27 acetylation and not associated with H3K27 trimethylation. Another command, segtools length distribution, measures the segment lengths of a annotation. So you give it an annotation and it produces these two types of plots. One is a plot for each label, what fraction of the genome does that label cover? And it will tell you both the fraction as a function of bases and of segments. And it looks like this. And it'll also tell you the distribution of segment lengths of segments with that label. So these are violin plots, it's kind of like a histogram of the lengths of that segment. So you can see this label looks like it has small segments on the order of a couple hundred bases, where this label has segments closer to a thousand bases. And finally, segtools aggregation measures the association between two types of genome annotations. One is a labeled annotation that you might get out of segler or chromatrum M. And it can take three types of annotations. One is a region annotation. So these might be annotated enhancers, for example. A point annotation, which might be motif positions, transcription factor binding sites, anything like that. Or a gene annotation. So both of these two, it will accept a bed format. This is GFF format, which is a gene annotation format. So here I'm showing you an example where it runs segway aggregation in gene mode. And given it this GFF file. And here's the plot it gives me. So it has different compartments for different positions relative to a gene, starting with the upstream region, the first entron, the first exon, sorry, other way around, first exon, first entron, the middle exons and entrons, the last exon entrons, and then the downstream region. And the vertical axis in these plots indicates the strength of enrichment or depletion. So for example, this label one is enriched in the upstream and first exon region. So it looks like it might be some sort of regulatory, looks like it's some sort of regulatory element. Whereas this one, label seven, looks like it's depleted all around genes, meaning it's probably some sort of repressive label. So segtools aggregation will give you associations between different types of annotations. Okay, so those were the three tools. I'm happy to take any questions. Thanks for your attention. To run some customized data on our local machines. The question is whether you can run these on your own custom data. Yeah, I mean, just call some modules, for example, some Python scripts from other machines to run this generous plot. Or it's only bothered with your old package. We have to send the data into your pipeline and run that. No, so these are just packages you can install on your computer and all you have to do is run the command line and it'll run. So you can use, like I said, it doesn't have to be, for example, for segtools, it doesn't have to be a segway annotation for it to be accepted by segtools. Yeah, that's a question. To run segway, do you have a minimum number of sample or a minimum number of marks? The question is if there's a minimum number of marks to run segway. Either a number of sample and marks. So segway, by default, runs on just one sample. It also supports a concatenated mode like ChromeWakeTremem does, but you can run it on just one sample and in the code there's no minimum on the number of tracks. You can give it just one track. But obviously it's gonna be more interesting annotation the more tracks you give it. Okay, so ChromeHMM, you can run multiple tracks together, right? Say again? For the ChromeHMM, you basically run multiple marks together to get the ChromeTremem state. So your segway only run one mark at a time? No, one sample. One sample, meaning one cell type. You can either do one cell type or you can input multiple cell types. That's true for both segway and ChromeHMM. And both segway and ChromeHMM, you can choose how many different marks you wanna give it. You may have mentioned this. Which was the step at which you actually did the manual annotation of the regions of the elements? Right, so segway, the question is where do you do the manual annotation? So segway will give you that file in bed four format with an integer label. There's a command in segtools that I didn't tell you about called segtools relabel, where you just give it a mapping from integer labels to what we call mnemonics, just your names for those labels. So you might wanna rename label one to promoter. And you can run your annotation through segtools relabel to change the labels. Also segtools, all the commands support a mnemonic file where you don't need to rename the labels in the file, and it will just do that renaming for you before it makes the plot. But of course you have to do that interpretation manually and produce that mnemonic file. Do you have a thought process? Go ahead. I don't know where you're talking. Yeah, go ahead. Do you have a thought process for when this makes the most sense and when it makes sense to binaryize the data and use an HMM? Like what kinds of data should be treated in each way? So I generally think that usually you're losing some information by binaryizing the data, but maybe Jason will tell you differently. But you probably also run into a set of issues around normalizing chip seek if you're gonna treat it as real value, right? So there's some challenges there too. Right, that is a question. So the question is what real value do you use to represent the data? We use the full enrichment transformed with an inverse hyperbolic sine transform. Inverse hyperbolic sine is similar to a log transform. Okay, let's tank max again.