 Well, thank you very much. Let me discuss these projects today. I'm the current programmer on these two things. This is sort of the primordial encode encyclopedia. And FactorBook is a product of the lab from some papers that were written back in 2012. And I've been working on updating the code base. So let's start out with encyclopedia. So louder? OK. OK, I'll talk that way then. So encode has been tremendously successful getting all of these human experimental data sets up. We have hundreds and hundreds of data sets over several different assays. But the first three letters of encode are for encyclopedia. And we don't quite have that yet. And this is something that's becoming an increasing concern. And we've been discussing at some of the encode consortium meetings and some of the phone calls. So we like to integrate this data together and actually figure out the function elements and build and visualize the encyclopedia. In my mind, I kind of see the visualization someday as Google Maps for the genome. You can have a very, very far zoomed out view that has long range interaction information, high C3D context sort of information. And then you can zoom all the way down to a single base pair and look up SNP data and whatnot. So I don't know how to do that yet, but we can start out with something a little bit simpler. To do functional annotation, there is actually a very nice analogy made by this paper co-authored by Snyder and Gerstein. You can kind of think of building functional annotations as doing signal analysis. You start out with a raw signal, you smooth it, you normalize it, you threshold it, and then you can start actually segmenting it and building up pieces and pieces with higher and higher amounts of information and data. Right now on the encode project site, there's already a few different annotations available. And there are different levels of complexity. You could just look at transcription start sites. You could just look at uniformly processed peaks that you can all now produce from the last session. You can look at 3D content information. Luca will talk about using hidden Markov models and Chroma HMM to actually do the segmentation and you can do dynamic Bayesian nets. Today, though, I'm going to talk about making an annotation using for a candidate, enhancers, and promoters. Now, this is going to be based upon DNAs and histone marks. And both of these give information about the chromatin structure. DNAs may include regulatory elements besides enhancers and promoters. But for a DNA peak that has an enhancement of, for instance, H3K27AC, you have a pretty good likelihood that it's actually going to be an enhancer. So we're going to build up an annotation on DNAs histone marks and hopefully display it in the browser. Based upon both the data availability and based upon many of the factors that kind of been hinted at this morning, we're using four different histone marks. 4ME3 to annotate promoters. 9AC for promoters and enhancers. 27AC for enhancers as well as 4ME1. Now, to start with, we had the STAM lab take all the DNA-seq experiments available from ENCODE and ROMAP and merge them together. This gives us a unified view of chromatin accessibility across all the cell types we have. Now, this includes all the STAM and Crawford DNA data sets and all the biological and technical replicates. But we didn't just do a bed tools merge on all these data sets. We did something a little bit different. We take the DNA data sets and look at a region where all the peaks overlap. So the rest is here, we have n different cell types and some of these cell types have no peaks, but others have more or less overlapping DNA's peaks. The diagram is perhaps a bit oversimplified. So you can merge these into a pile that kind of looks like this. Instead of taking the entire pile as one region, what we call a DNA's hypersensitive region, we're going to select something we call the master peak. And that's the peak with the strongest signal. That peak will become the representative peak for that region, and we will discard the rest of the peaks in that area. The weak peaks shall not inherit the region. Now, from these master peaks, we have a few advantages. We get this set of unique, non-overlapping peaks. We get this representative peak for this area, and that peak we annotate with all the cell types that are involved in that region. These peaks can span all the data sets, and when you actually merge them all together, they cover about 20% of the genome. From this track, from this set of master peaks, we can then divvy these up into transcription star site proximal and transcription star site distal peaks. The TSS proximal, they're within 2KB of the gen-code TSS annotation. So now we can actually start looking at annotating candidate promoters. For us, the candidate promoter is going to be two things. One, they're going to be a TSS proximal master peak. And then two, they're going to have an annotation with a TF binding site. For now, we use all the chip seek TSS, all either gap peak or narrow peak, what not, from ENCODE. So we did a humongous bed merge, a bed tools merge, and build an annotation track. These peaks are also TSS proximal. For candidate enhancers, we do something similar with one extra step. For enhancers, we're going to use the distal peaks, like I alluded to before, and we're going to annotate them with TF peaks. But we're also going to look at 27 AC enrichment. So given the master peak, we're going to go into the 27 AC signal file, look at that region, look at a 1KB region, and compute the percentile of that signal over background. Background being randomly chosen segments of that signal track, ignoring other locations with DNA peaks and ENCODE blacklist. So if this 27 AC region is in the 95th percentile, we're going to call that a candidate enhancer. Now, as a little bit of a justification, we can do this, we can do just what I talked about, on one cell line. Sometimes a cell type, you can look at GM12A78, we can look at enrichment of TSS proximal DNA peaks and see how they're enriched in 27 AC peaks. And we find that, yeah, for the actual DNAs peaks, they have very strong enrichment. Same thing happens for the distal peaks. Now, how well do the master peaks fare? You can take all the 27 AC peaks for all the different cell types we're using, and you can intersect those with the master peaks for each particular cell type. So for instance, we may have pancreas here, and we're going to take a GAPP file for 27 AC, and we're going to intersect that with the master peaks that have the pancreas cell type in them. And we see that more than three quarters of the 27 AC peaks also intersect one of our DNAs master peaks. Swapping that around, we can look at the master peaks for each cell type and see what percentage of those peaks are covered by 27 AC. And we see that that percentage is a good bit lower. There are other things going on with those peaks as we kind of alluded to before. So that's it. That's how we actually build these tracks. It's a bunch of Python and Shell Scripts AUK and bed tools. So we can actually get these tracks. As Mike and Yuri mentioned, you can get them from the ENCODE Project site under annotations. And you can actually go to the genome browser right now if you want, or to the WashU browser and look through them and look for your favorite region. You can also directly download all the tracks that we actually produced. And these are big bed files. And Yuri has committed them into the ENCODE Project and they have a unique ascension number. So I cannot change these on you under your feet. You can always go back and get the exact file that you were using. These are the links typed out. As mentioned yesterday, I prefer using the Track Hub because it actually is super easy to generate. And you can look in here and actually see how to colorize your track or fiddle with the track a little bit. We also produce tracks for WashU called hammock tracks. These tracks are really designed specifically for annotation. And you can actually embed JSON data into the track file and give the user a little bit more information. There's some more information about the hammock track here, and I'll have a little demonstration shortly. So this is what it looks like in the UCSC Janium browser. And if you actually click on one of the annotations, you can get things like, for instance, if you click on DNAs, you can get all the cell types involved from that region, just as promised before. As a kind of a slight note, when using these tracks, you may want to enable these other tracks in UCSC. These can actually be quite helpful to figure out what is actually going on in the region if you haven't already done this. Now for WashU, this is what the annotation tracks look like, and you can actually click on each peak. And like you saw in the Janium browser, you can also get the list of cell types. You just don't have to open up a separate window. So the next steps for this are I actually have to open source it, finish up, and clean up the code a bit more. I need to generate this kind of primordial encyclopedia for mouse annotations, and then we need to figure out what other data to add. Changing gears slightly, back to book. How many people here have actually used Factory Book? One, three, five. OK, so unlike the last 10 or 15 minutes of my discussion, this is not a track-based approach to looking at things in encode data. Here, we're looking at data summarized on TFs, in particular. It's not easy to show this in the Geno browser. This is all about chip seek TF data from encode. And we're trying to pull together a bunch of useful analyses that people can look up. There are two different sites now for Factory Book. One is the original one, which is a bit aged now. And then this is the one I've reimplemented. On the main page, you get a matrix of cell types versus the TFs. The number in the intersection is a number of chip seek experiments from encode that we have. Right now, you can click on the TF at some point, you'll be able to click on the cell type also. And then you get a page that has a bunch of information. One part of that is about the function of the TF. And this we have distilled from a bunch of public websites. And when possible, we have protein family, consensus sequence, multilevel consensus sequence. If you have a PDB image, we try to pull that in for you. In addition to that, we have these average histone profiles. So what happens is, we take a 2KB window around every summit, every peak, and then we actually compute the average histone signal for that. We split these apart, kind of like in the psychopedia into proximal and distal sites. And then you can get, for instance, here, proximal H3K24ME2, or you can get distal H3K4ME3. And those are color coded individually. And this is sort of a little interactive JavaScript chart that you can use to explore the data a bit. We also provide profiles of the average nucleus on positioning. This is GERI from M&A's SeqData for GM-878 and K562. And this is split, again, into proximal and distal. Proximal is 1KB of a TSS, and distal is everything else. We also run mean chip on the top 500 peaks. And we pull out the top five logos. And we are going to give you a little bit of information. So for instance, this is the top one motif. Out of the top 500, Chipsy peaks for 84 had this sequence. This was the E value for mean chip. And this is the consensus sequence. And lastly, for factor book, we have a bunch of these heat maps. So our interest, and maybe was alluded to a little bit before, is in comparing this TF to other histones in the same cell type. And other TFs in the same cell type. So in here, if you look here, let's say we're in A549, and we're looking at CTCF. We can compare CTCF against the histone control and all the other histone marks we have in A549. Each row here is a Chipsy peak. This has been sorted highest peak to lowest. And this is, for histone, this is a 10KB window. We then show the enrichment as a heat map on a normalized scale over the 10KB window for each histone mark. Likewise, for TF, we show binding strength across all the other TFs available in this cell type. This is a little bit narrow window. This is 2KB. And this chart is perhaps a touch old. The beta side has a little different orientation of this, but it's the same idea. We have Chipsy peaks. We're going to limit those to 10KB, 10,000 peaks, sorted by Chipsy strength, and then a comparison against all other histone and TFs. The first release of FactiBook had about 550 Chipsy data sets. The new release has over 650 in more cell types, and we're also now introducing it for mouse. This is still, needs a little bit more work on my part, but that will become released very soon. So the next step for FactiBook, FactiBook will also be open sourced, and this will go in sync with Encore Project. As new TF Chipsy data sets are released, from Encore we will immediately import them into FactiBook. And as kind of a new development, we're also giving this a full REST API. So all the, for instance, all the profile information you can actually download via JSON and use yourself. You'll be able to download all the motifs yourself and use in your own programming, in your own projects, very similar to how you'd parse JSON from Encore Project. And lastly, I really would plead if you have any new ideas for FactiBook that you'd like to see, just email me. I really want to make this a more useful tool and hopefully some of the community can really use for a wide variety of things. And that's all I have. And in particular, I'd like to thank Sonya, who really built the primordial encyclopedia and I took over for her. And that's all I have for now. So we have time for some questions. Do you calculate the profiles on the fly or do you have to recalculate them? It's all pre-stored. Okay, so every time, okay. Quick question for me. So the things you already are on the FactiBook, is that in sync with the current Encore data? And beta, for the most part, yes. Okay, cool, thanks. For WWW, no. Okay. So when you define the proximal from the TSS, is the TSS specific for a major transcript or how do you decide that? Oh, that's a good question. I'm not sure, I'd have to look that up for you. Okay. Cool, so thanks, Michael, again. I just want to make sure that you know that all the slides in this session can be downloaded from the conference website called 2015.org and all the hands-outs are there. So just in case you didn't follow up all the procedures in pre-sessions or this session, you can always go to the website and download the PowerPoint and the PDI files. If the PowerPoint is not there yet, they will be available after the conference.