 I'm going to first give an overview of chromatin states and chromHMM and then give more specific detailed information on how to access existing chromatin state annotations, which for many purposes would be sufficient, but then also how you could run chromHMM on new data that you might have generated that you might be interested in working with. So chromHMM leverages largely genome-wide maps of histone modifications, but can also take other types of data such as open chromatin, and this data is typically mapped genome-wide based on chip-seq experiment, and there's dozens of different histone modifications. These five have become the core histone modifications of the roadmap epigenomics project, and then there's been other marks that have been mapped in smaller numbers of cell types. So what we're interested in doing with chromHMM is taking potentially many different maps of genome-wide signal and then converting that into chromatin state annotations that leverage the combinatorial nature of the data and the spatial information to give a single annotation to each location in the genome, and these annotations could correspond to candidate enhancer regions or transcribe regions or promoters and gene starts. So we're underlying this assumption that there's these different classes of elements that we don't get to observe directly, but what we do observe after our pre-processing is a binary presence or absence call for each of the input tracks, whether we had enough reads supporting its presence at a specific location. And then we make a modeling assumption that at each location we're in one of the finite number of hidden states, and associated with each of these hidden states is different probabilities of observing different modifications. So for example, if we're in the state one, we would have a high probability of these marks and potentially lower probabilities of these other marks, and it's formally based on a multivariate hidden Markov model, which also allows spatial information so we could differentiate, for example, between here and here, we're still in this state just based on the neighborhood information. And all these probabilities are learned de novo from the data itself. So to give an example of this, this came in the context of the ENCODE 2 project in collaboration with Bradburn Scene's ENCODE production group where a key set of HISTOM modifications were mapped across nine diverse cell types. So this is looking at all the raw data at one location. And what we did was we applied the modeling approach to each of these cell types, treating additional cell types as if they were additional chromosomes. So we learned one common state definitions across the cell types, but we could have cell type specific assignments. And what you see here are the rows correspond to different states, the columns correspond to different marks, and then the values correspond to what's the probability we would observe that mark in that state. And then we can overlap. So once we learn these models, we can assign each location in the genome to one of these states in each cell type. And then we can overlap this with other type of information, such as locations of transcription factor binding, evolutionarily conserved elements, gene annotations. And start giving candidate annotations to these signatures so we could identify active promoter regions, which we often color in red. Enhancer regions in orange or weaker. Enhancer regions in yellow. Insulator regions transcribed or pressed regions in heterochromatic regions. So this is another view of what we've done here is we've taken the same gene in four different cell types. This is the raw input data. And what we've done is we've summarized all that data into one prominent state annotation that's color coded for each cell type. So somebody could go into the genome browser and quickly scan across some location to get a sense in what different cell types it's active in or not across the entire genome, including the vast intergenic regions. So how we actually go about accessing these existing chromatin state annotations. And I'm going to focus now on the set that had been produced as part of the recent Roadmap Epigenomics publications. But I know that 16 of these 127 cell types were based on data generated by the ENCODE project. So one issue is that we need to have a common set of marks across the states if we don't want to confound downstream analysis. So as part of the Roadmap effort, we took the five marks that were common across all the cell types and learned this 15 chromatin state model. And now you're seeing chromatin state annotations across 127 cell types. But that ignored all these other modifications. So we learned another model based on adding H3K27 acetylation, which is another key mark of active elements. But that was only defined across 98 of the cell types. We also have an alternative strategy based on first imputing data sets using a new method, chromimpute, which I developed, which I won't have time to go into the details. But if you're interested, it was published down earlier this year. And what this allows you to do is it makes predictions of what one of these experiments should look like, leveraging all the other information we have, including how that mark is looked in other cell types and the other marks that you have in some cell type you're interested in. And we learned a chromatin state model based on the 12 marks that we had enough data to be most confident with the imputation on. And then have a chromatin state model defining 25 states based on the imputed data for these 12 marks, giving a richer annotation. And then this is a screenshot of one location for these across 127 cell types of these 25 states. And then this is a screenshot of the labels of these states that's available on the UCSC genome browser, summarizing the three different models. These were the inputs that they're defined based off of. And then the candidate annotations we've associated with these states. So all this data can be accessed in raw form from the Roadmap and Code Integrative Analysis Portal. So this is the URL. And if you go to this portal, there'll be a tab, chromatin state annotations. And then you can click on one of these three tabs to access one of the corresponding three models. And then somebody can click here and then click to visualize these chromatin states or you have links to download the raw data files. If you're a computational person and want to write your own scripts. Alternatively, this can be accessed through the UCSC genome browser. It's important to note that right now it's for HD19, which until very recently was the default, but it's no longer the default assembly on the browser. But to actually annotate it, you have to go through a couple steps. It's available through the track hubs. So then this is this button. And then you're going to need to click here for the Roadmap Integrative Analysis and click Connect. And I just want to highlight, this is different than the Roadmap Epigenomics Data Complete Collection. That's if you want to access sort of the raw unprocessed data. But the more processed integrative analysis is through the separate hub. Once you click Connect, you're going to get a series of menus now available on your browser. And then for this consolidated menu here is where you would want to focus to access the ChromeHMM annotations. Then the first, once you click on that, then ChromeHMM will be the first thing on this menu. And then you would want to make sure both of these boxes are checked so you can have access to either the directly observed data or the imputed data. And then there's these options here to decide if you want to work off of the primary core five-mark model. The one with the auxiliary one is the one that also is based on H2K27 settle but not defined across all the cell types. And then the imputed one is defined based on the imputed data for 12 marks. And then once you've accepted that, then it'll load into the UCSC genome browser, and you have access to all the chromatin state annotations in the browser. And then there's an alternative way of accessing it through the epigenome browser at the Washington University. I'll just give the URL and not walk you through how to actually load it, but there's information at that site. So now to explain the ChromeHMM software, so this is the URL. If you're interested in downloading ChromeHMM in the website, and then this link at the top is the software download. And this is also another important link on this site, the software manual, which has all the details about the various commands. So in order to try running ChromeHMM, if you're interested in starting it to launch it while I'm talking, these would be the steps you would need to do. You can download it from the website. This is the full URL to the software. Java already needs to be installed on your computer. You would unzip ChromeHMM, then you would need to open a command line. So if you're like on Windows, you type from the start menu, hit CMD. Then you would need to change into the ChromeHMM directory, so you would send the CD to the path of the, where the ChromeHMM JAR file is sitting. And then you would enter this command. So it's, I won't read it out loud, but you can see here, this would launch ChromeHMM on the sample day. And I'll take you through those steps in a minute. And this will take about five minutes to run. So if you're running it right now, by the time I start talking about the output, hopefully it'll be there. So what's the input to ChromeHMM? So ChromeHMM, the actual model learning is based on binarized data, this 200 base pair by default resolution. So typically, how it's sort of recommended, obtaining this is directly from the location of aligned reads. And there's two commands that one can do is, if your reads are in a bed format, you can use the command binarized bed. And in the most recent version of ChromeHMM, I've added another command, binarized BAM, which does the same thing, but can directly read BAM files. So there's no need to first convert your BAM files to bed files. So this is sort of stepping through the binarized bed command. So the first part is information to Java, which tells you, tells it how much memory it's allocated. And then the dash jar and the ChromeHMM dot jar, which is telling it the Java, which files to use. And then the next command is telling ChromeHMM which are the several different commands to be using. So by default, I mean the first one here is binarized bed. And then there's the command to, or the option to say which chromosome length. So it needs to know the length of the chromosome. And most standard chromosome length files are already included with the download. But if you're using a non-standard assembly, you'll need to download that yourself. Then you specify the directory that contains the bed files. And then there's this design file, which the contents are first column specifies a cell type or tissue type of each file. Then what market is, and then the name of the file containing the reads corresponding to that. And then optionally you can specify a corresponding control dataset. Then the output directory and the, is the last thing. So now actual learn model command has the same beginning here and then learn models that ChromeHMM command. This is one non-default parameter that you might want to consider and I specified that before. So this dash P zero, what that allows ChromeHMM to do is access more than one processor. So it can run more efficiently if you have a multi-processor machine. If it's not specified by default, only take one processor. And then you specify the directory which has the binarized input. So that would be the output from the binarized bed command. Then the directory where you want the output report for ChromeHMM to go. The number of states you want ChromeHMM to be able to learn with. And oftentimes in practice, want to learn models across a range of states and then use sort of the level of biological interpretation to guide which one to sort of analyze in depth. And then you can specify the genome assembly of that you're working with. And ChromeHMM will use that in terms of also doing automated enrichments. So you'll get a ChromeHMM report when it's done and if you've been running it on your own laptop, hopefully something like this has opened in your browser will soon. And just to take you through what's contained in this, you have at output a record of what the options you used were to the input. So you have a record of that. Then you have an emission parameter matrix. And then for each of these, you also have an option to download it in an SVG format or a tab to limited text file format. So you can get the numbers behind this. Then you have the transition parameters which are part of the HMM which helped to define the model. And then there's a model file which is not designed to be human readable but more for ChromeHMM to be able to do additional work on that model if you want to reuse it later. And then there's a link to the segmentation file. So these are four column bed files saying for each coordinate range what state it's in. And then also file formats that are designed to be in UCSC genome browser or IDV or other genome browser friendly formats. So one can load into the browser. It shows you automated enrichments for certain preloaded files. But you can also, there's options to provide your own annotations either initially or you can once you have the annotations with other commands to compute additional enrichments. You can also see positional enrichments. So this is saying the enrichment relative to some positions such as transcription start sites. And then, so that was all for one of the cell types. As you continue scrolling you'll see the same information for the second cell type and for how many additional cell types you have. And then just to point out that these Crematin State annotations we've used previously to show how they can be used to interpret disease-associated variation as well as other recent works such as the work of Melina Monoles-Kellis where they use actually the chromin-pute annotations and then chromatumem on top of it to dissect the FTO loci as well as also interpreting epigenetic variation associated with disease. And there's many other examples in the literature by other researchers either using existing annotations or using chromatumem applied to their own data. And so that concludes this work and to acknowledge Melina Monoles-Kellis and all the encoder roadmap consortium which has provided this data. And I just want to point out that I won't go through it but it's included in the slides if you download it. Many other additional chromatumem commands are also described in the manual. So that's available.