 All right, let's get started for the next session. As you know, we've moved the tables out a little bit so you should have a little more room to move around in. Almost all these tables have power now or power strips. If not, you just have to poke around and look for one nearby or share your power if you need it. Sometimes those strips are under the table. So if you look under your table, a couple people have had some issues with the Wi-Fi, pretty much everyone should be on. If you have any trouble, talk to me or talk to Chris or find someone else as a DCC, we can get you on. The first speaker in this session is Zee Ping Weng from the University of Massachusetts Medical Center. She's the encode head of the data analysis center. And she's going to talk about the encode encyclopedia. So let me get started. So we encode, as you may realize, stands for the encyclopedia of DNA elements. And we're going to talk about the E in encode. So you have heard from Mike Payson about the goals. And I'm just going to go give you a little bit of the structure. So in encode, we have excellent data producing groups. And then data coordination center who organizes this meeting. And the data analysis center basically takes the data and try to, one of our goal is to build the encyclopedia, which is more like synthesized version of data. And we try to stay with very closely with the data and try to be very intuitive and try to be easy to interpret. So you can access encode encyclopedia from this tab from the encode portal. And if you click about, you will see the introduction. So we organize the encyclopedia into three levels. And the ground level, the middle level, and the top level. So a little close up look, the ground level stays really close to data. And mid level is a little higher integration and then top level is even more integration. And today, I'm going to very quickly go through these components of the encyclopedia. But very much of my talk will be focused on a few things from each level. For example, I will spend a few more slides talking about transcription factor chip seek peaks, a tool that we have built to visualize them. And then how we build promoter-like regions and enhancer-like regions in a cell-type specific manner. And lastly, I will spend more time talking about how to link enhancer-like regions with target genes. So just a quick going through about what components are in each level. As I said earlier, the ground level typically derived directly from experimental data. And we try very hard working with the DCC, Data Coordination Center, to build uniform processing pipelines. And then we run the raw data through these pipelines. And then we detect annotations, such as the peaks of transcription factor binding, the peaks of histone mark enrichment, and gene expression quantification. And so one component is gene expression. So you have copied to all these slides. And in these slides, we show you that which encode groups produce this data and which encode groups work on data visualization or analysis. So I'm not going to go due to the time interest of time. And I'm not going to name everybody, but you can see that which groups are contributing which components. So for example, you can see that gene expression. You can see all these cell types. And then for a particular gene, how the expression pattern is like. And for transcription factor binding, we have many chip-seq data. And I will spend a few slides talking about a tool we have built, my lab, for visualizing aggregate peaks, which are enriched genomic regions that are bound by transcription factors. And so this tool is called Factabook. So you can access it through Factabook.org. And it also gets connected directly from the encode portal. So the goal is to visualize, summarize the data centered on individual TFs. And such summarized data may not be easy to access, may not be easy to see on the genome browser. And then we also try to link them with histone profiles, sequence motifs, and heat maps of these peaks across the whole genome. So it's a TF-centric view. And we hope to build the same process for chip-seq of histone and DNA data sets as well. So if you look at Factabook, you can have a glance as to how many chip-seq TF data sets the encode construction has produced, hundreds of data sets. And then you can browse factor by factor. And when you go into each page of the factor, you can see the 3D structure, distilled function information, and any link to external resources. And then if you take all the chip-seq peaks across the genome, we show aggregate plots like this. How does each kind of histone mark look like in the vicinity of these chip-seq TF peaks? And we also have data for histone profiles in two cell types. And you can also see them. We segregate the chip-seq peaks into two groups, transcription, star-side, TSS proximal, and TSS distilled regions. And you can see they have pretty distinct different profiles for the histone occupancy. And you can also look at the enrichment of sequence motifs in these chip-seq peaks. We show top five peaks. We show top five motifs, excuse me. This is motif one. And we filter our data so you can see all five motifs that are enriched, in this case in back one chip-seq peaks. And then these two we think are more likely to be biologically meaningful. And those may be false positives. But we still keep them in case they are co-factors motifs. And this is the heat map I mentioned earlier. So each column is a chip-seq peak. And then the entire heat map is all the chip-seq peaks across the entire genome. So this is a pivot TF, in this case, black one. And then you can look at all these chip-seq peaks for black one and see whether or not there are histone marks around these peaks across the genome that are correlated with the chip-seq of black one or other TF binding. So that just was a very quick summary as to one method for visualizing these chip-seq peaks. And go back to the ground level annotation. We also have a lot of histone mark chip-seq data. And we also detect peaks using these data sets. And you have heard DNA-seq data. We also detect peaks. And also DNAs binding footprints from these data. And one Yop-decas group produces high-C data from high-C, which is chromatin structure data. And you can analyze them and then derive topologically associated domains. These are TADs and then compartments. So you can show here these are TAD boundaries in green and then the compartments in red and blue. We also have cheer pad data. And from cheer pad data, you can derive links between promoters and distal regulatory elements visualized as loops here. And we also have a lot of RNA binding protein occupancy data produced by Eclipse-seq. And then from Eclipse-seq, you can see these peaks and then also binding sites. So that just summarizes all the ground level annotations. And you can go in and look at each one and download them. They are all available from the encode portal. So let me move on to the middle-level annotations. And middle-level annotations integrate multiple types of experimental data and then ground-level annotations. And I'm going to focus on two types of middle-level annotations. One is to predict enhancer-like regions using biochemical data, very specifically among the many types of biochemical data we have in encode. There are DNA-seq and then a whole battery of histone-marked chip-seq data. And analysis of individual types of these data, we found that two of them stand out DNA-seq and H3K27AC chip-seq to be excellent marks for enhancer activity. So we analyzed different methods to try to combine these two types of data to see how we can best predict enhancer-like regions. And we want the method to be unsupervised because we want the method to be applicable to a large number of cell types. So the rationale is that we will use the very rich matrix of histone-marked and DNA-seq data built from our mouse because we have a number of histone-modification chip-seq data by Bing-Renz lab and then we have RNA-seq data, we have DNA-resolation, we have DNA-seq on a whole matrix of mouse development. And we have run these data through the uniform processing pipelines. So here is the whole matrix of data and the columns indicate developmental time points and the rows indicates cell type. So as you can see that for E11.5, we have a number of cell types for which we have both H3K27AC and DNAs available in the same cell type. So we decided to combine the data available in the cell type along with experimental validation of enhancers generated by Len Panaccio's group integrated into the Vista database. So we can compare these data with functional data to see what kind of model will be best in predicting enhancer-like regions. So a little bit about Vista. So it's a mouse transgenic assays. So you take clips of DNA and then you inject into mouse embryo and with a reporter. So you see which cell type the reporter is expressed that will indicate enhancer activity in that specific cell type. So we have over 2,000 regions that have been tested. And so for several tissue types such as limb brain sub regions, there are over 200 active enhancers. So we decided to combine these with encode chip-seq data and then compare to see which model works best. So these are different regions. You can see that forebrain enhancers light up in forebrain and membrane enhancers light up in membrane, so on and so forth. So in model testing, we considered a number of options. So if you want to detect enhancer-like regions, as I said earlier, DNAs and H3K27AC peaks work really well. But how do we center our predictions? Do we center on DNAs peaks? Or do we center on K27AC peaks? This question actually came up earlier, I think during Chris's talk. Someone in the audience asked, have you used histone mark? So we consider which one works better. So how do we rank these peaks? Do we rank them by P-value? Do we rank them by chip-seq or DNA-seq raw signal? Do we incorporate additional signals such as DNA methylation, other histone mark? So we used enhancer, vista enhancer data set to evaluate all these options. So we can come up with the model and then predict the peaks and compare with vista regions to see whether or not they are true positives, they are false positives, false negatives. Obviously, because the vista enhancer regions relatively few compare with all possible regions in a genome that could be enhancers, so we don't really have a good handle in true negatives. But these other three evaluation we can make pretty well. So because we don't really have a good handle on true negatives, we use these precision and recall evaluation. Precision is an evaluation of, among all the predictions you can make, what percentage of them are actually true? Recall is just sensitivity. Among all the vista enhancers, how many of them your model can capture? And this PR curve is particularly relevant in genome-wide predictions when the fraction of positives is very small, much less than 50%. So we get a curve and then as you can see here, there are five different ways of evaluating how you should score these peaks and the blue curve behaves the best on the top and that is an average rank of DNA signal and the K27 AC signal. So this is a very, very simple model and easy to interpret and easy to apply to multiple cell types. So at the end, we decided to center our predictions of enhancer-like regions on DNA's peaks. They work better than if we do that on K27 AC peaks. And we also found using our training and testing set incorporating additional data didn't significantly improve performance using our test set because we really want to stick with a highly interpretable model and easy model. So we decided to leave these other additional information out for the time being. So how does the model work? You take DNA's peaks in a particular cell type and then you look at the signal, measure the signal and then you also take K27 AC signal in the same cell type and you take a windows. So it has been mentioned earlier that very often you will observe DNA signal in the 12th of K27 AC signal. So the combination of these two is particularly powerful and we get the rank and then we average the rank and the average rank is basically our prediction as to how likely a region is to be enhancer-like. So we annotate this region with a DNA's peak in the middle and then the entire region according to H3K27 AC peak. So you can visualize them. So here is an example of such an enhancer-like region in NeuroTube and you can see that this region has been tested using reporter mouse transgenic essays and it lights up in NeuroTube and the prediction is spot on and you can see this very beautiful signal between K27 AC and the DNA's and this indicates that it's a very strong prediction. So we applied very just very quickly, we applied a very similar approach to predict enhancer-like promoter-like regions and in this case we are trying to predict gene expression. So we also use the same approach to evaluate different models and we looked at all kinds of histone marks and the DNA's and then we want to apply the method to as many cell types as possible. So here we are trying to predict the expression of a gene and we rank the expression and then we can rank them by each possible histone mark or DNA's and see which one works the best. So in this example you can see that each dot here is a gene and H3K27, H3K4Me3 is very good in predicting expression and you get the R-square of 0.63. DNA's by itself is not as good as H3K4Me3, R-square is only 0.37, H3K27, AC signal by itself is also not as good as H3K4Me3. But very similarly as we found out if you combine H3K4Me3 signal with a healthy dose of DNA signal you can improve the correlation to R-square of 0.65 and also you get a lot fewer ties at a high expression level. So that's what we end up using. This is our simple model of combining H3K4Me3 and DNA's for predicting promoter-like regions. So here's an example in the Linfo Blastoy cell type and you can see that it has promoter-like regions with DNA's signal in the middle and then with flanking H3K4Me3 signal. Very strongly agreeing with the transcription star site and gene code genes. So we have built a visualization tool for visualizing these enhanced-like and promoter-like regions in a cell-type specific manner. And here is the link that we have made. And here, so the goal is that we have a proof of concept please if you have time during break give it a try. And we're very interested in knowing whether or not you'll find this tool to be useful and eventually we would like to implement this tool at the DCC to be integrated into the portal. So here is the visualization. So you first choose a genome you are interested in and you can query this tool either by gene, by SNP or by genomic coordinate of the region you are interested in. And in this matrix down below you can pick the cell type that you work on and then at the end you can visualize the region in either the UCSC or the WashU genome browser and they show up like this. So here is an example region and you can see that these are the regions we predict to be enhanced-like. And then alongside we display the DNA-6 signal in orange and H3K27 AC signal in blue. So here is one cell type which is thymus and then next cluster of data is another cell type yet another cell type and then you can turn on any other tracks you want to compare with along with the predictions here. So for those cell types that we do not have both DNAs and H3K27 AC data you can also, we have also made similar predictions of enhanced-like regions using just the DNA-6 data or just H3K27 AC data. Both of them individually are very good predictors of enhanced activity. And you can visualize alongside with the same tool and eventually we want to integrate into the DCC and then show up with this kind of matrix view. So it's much easier to pick and choose the data sets. So those were the examples of middle-level annotations. So for the top-level annotations we aim to integrate a broad range of experimental data as well as the lower-level annotations. So one component of the top-level annotation are the Cromenton states. So here is an output of CromMHMM and you can see that this image is actually epi-logo. Each row is CromMHMM output in a particular cell type and across all these cell types and you can see that the promoter state in red is highly conserved across cell types and then yellow is enhancer states and green is transcribed gene body states. So this gives you a bird's-eye view of CromMHMM states across many cell types. So you will hear about RegulumDB and Hyperreg and then also FunSeq2. These are encode methods for trying to visualize GWAS SNPs along with annotation. So the last bit I just want to talk about predicting target genes of enhancers. So that's a component of the top-level annotation. So you have heard from earlier talks about how important it is to predict target genes of enhancers because very often you want to know which genes are causal of which phenotypes. So again, very similar to our earlier approach for predicting enhancer-like regions. Here we want to create benchmark data sets for comparing methods that are very commonly used. You have heard about this correlation-based method from the previous talk. Basically you'll have a DNA signal across a panel of cell types. If they correlate between enhancer and the gene, you may say that enhancer is regulating that gene. That's one of the predominant methods in the field and we want to know how well it works. And we want to know whether or not if you incorporate additional data, do you improve the performance? And obviously for a consortium, we want to get input from different groups and compare other methods. So very quickly, for benchmark data sets, we started out with a promoter capture high C from Osborne's group earlier. So basically this is like high C but enriched for promoter-based links. And we started out to integrate additional data sets using GM1 to A7A, which is a tier one and co-cell line for which we have so much data. So we incorporated the chia-pad data, which is a common term structure again, using red 21 as the factor you're chipped down. And then we incorporated EQTLs in lymphoblastoid cells from HEPA-REC. And then we incorporated the very high-quality high C loop from the Aden lab. And so you can see that if you take these three data sets we intersect, they don't intersect very much between pairs of them. And then the middle indicates how many of the enhancer promoter links that intersect with promoter capture high C. Okay, so we basically take the regions in the middle to be our training set. But this is ongoing work and we have additional data. But this result I'm showing you here is pretty stable. So if you look at these, the enhancer target gene links in this region in the middle and you make a distribution in the distance between the enhancer and the target gene. And I think people asked earlier, do you get the nearest gene as your target gene? So you can see that this is an absolute distance in KB. So you can see if you make a cutoff at 100 KB, you basically lose a third of the links. But if you go up to 500 KB, you will lose 3%, okay? So these are very strong distance dependence. So those were the positives. So in order to train a method, you also need to have negatives. So we basically pick the genes that are also in 500 KB, but not linked by any of those data sets, promoter capture high C, high quality high C, those four data sets. So those we treat as negatives to evaluate the model. So we divide our data into training set, validation set and test set. And roughly 5% of the cases are positive. This is pretty typical because many of the links are actually not real. So we evaluated a bunch of methods, correlation-based methods. So you can get a signal of different types of histone marks or DNAs in enhancer region with the promoter and then you perform correlation calculation across a panel of cell types or tissue types. And this is not as simple as it sounds, because you could correlate the raw signal or you could perform Z-score normalized correction for your raw signal and then correlate the Z-score or you could correlate a DNA, you could correlate a K27AC, you could use all N-core cell types, you could use roadmap cell types, you could use Pearson-Spierman. So basically not to belabel the point, there are a lot of choices you could use. We wanted to know which one works better. So here you can see this is a rock curve, true positive rate versus false positive rate. The best method will go all the way up all the way around. So as you can see that black works the best, which is an average rank of DNAs in K27AC that gives you an area under the curve of 0.76, okay? Normalize the signal works better than raw signal and DNAs works a bit better than K27AC. This is the same performance, but just plotted on the precision recall curve. As I mentioned, only 5% of the links are true positive and that's why it's very important to do PR curve calculation because this rock curve normalizes the ratio between positives and negatives. So for genome-wide test, PR is a more realistic evaluation. So you can see they don't, none of these methods work great because for PR curve, the best method will go from the top and all the way down. Okay, so none of these work really great. So here are some examples. In some cases, correlation really works very well. So in this case, so the target gene is TR-10 and then here is the promoter that correlation predicts that's regulating that promoter. And you can see if you plot K27AC signal across a bunch of cell types, here each dot is a different cell type. You see that the promoter lights up only in GM1287A and also the enhancer only lights up in that. So this is most likely to be a real link. Here is another example that's likely not to be real just because as you can see that average K27AC across the promoter is very high but in GM it's actually not that great. So this is likely to be a false positive predicted by correlation. So we talked earlier about distance information. Distance is a very strong feature in predicting enhancer gene links but if you use a very hard cutoff like 100KB you miss a third of the links. So can you be more intelligent? Can you use a quantitative method to incorporate distance? So you can do something very simple just to see how big impact it is. So you can rank target genes by the distance to an enhancer and see how well it works. And then you can compare that with your correlation and then you can average the ranks between the distance and the correlation to see if it improved just to tell whether or not how much it's contributing. So here is the black line is the performance if you just use correlation. And green line is if you just use distance. You see if you just use distance you actually do better than if you just use correlation. But if you combine correlation with distance using a very simple mind just averaging the rank you do better than both of them. So this is the rock curve and this is the PR curve. The improvement on PR curve is even more substantial and this is the curve we actually care about for genome-wide prediction. So just a very quick conclusion for the predominant method for predicting target genes which is correlate DNA signal or histomark signal across a panel of cell types. DNA is slightly outperforms K27 AC and it's better to use Zscore normalize the signal and not raw signal which makes sense because raw signal is not normalized and it's probably going to be very dependent on sequencing depth and other issues. And if you use Pearson correlation coefficient it outperforms SPMN and ranking by correlation coefficient outperforms ranking by p-value which is really good news because p-value takes a lot longer time to compute. And if you incorporate distance information that drastically increases performance. And can you do even better by just ranking distance and correlation? Sure. So what we aim to do is to develop a machine learning algorithm. Specifically we use this random forest model. We want to have a minimum model that can be applied to a whole battery of cell types and tissue for which we have encode data for. And here we will incorporate DNA signal, K27 AC signal, the correlation and then some sequence dependent features. And then for some of the encode cell types for which we have a lot more data we can make a more comprehensive model. So just very, very quickly. So here is what we get at this point. If you use these features, distance, average conservation, average DNA signal, average K27 AC signal. This is a sequence based features just KMOS just tri-nucleotide, hexa-nucleotide correlation between enhancer and target gene. And if you combine all that into the random forest machinery. So you can see that again, black is correlation alone and orange is the average rank between distance or correlation. And then this purple line is random forest using the minimum model of those features that I mentioned in the previous slide. So you can see, you see a market improvement. This is rock curve and even more drastic improvement for PR curve. And this is really the region we care a lot about and you see a very big improvement. And you can, random forest gives you a quantification as to which features are important. So the most important that contributes the most is distance. And then below that you see a lot of promoter features contribute quite a bit slightly more than the features based on enhancers. So I mentioned that for some of the ENCOCEL types we can have a comprehensive model and here we just want to share a little bit of results on gene expression. So how well can you do when you add other features and specifically gene expression? That really helps. So when you add gene expression now you get this purple line. It's definitely better in rock curve and it's again markedly better in the PR curve. So if you look at the feature contribution indeed expression is a very important feature expression of the target gene. Okay. So basically here is we have a pretty simple model. We have implemented the results using GM12A7A. And if you go to the visualizer I mentioned earlier for visualizing enhancer-like regions promoter-like regions. There is a tab for visualizing enhancer target gene links in GM12A7A. We are in the process of implementing these models after the discussion with other ENCOCEL groups we're going to implement it across the entire panel of ENCOCEL types. So we are in the process of evaluating additional training and testing data. And we recently found a cheer pad of PAL2 to work really well. So we have some new results. Don't change the conclusions here but overall the performance is better. And we can retest additional features if we have training set, a larger training set because we don't want to overfit. And obviously this is one of the key challenges and is to predict the target genes. I think a lot of people are interested in that. So let me just conclude my talk by thanking the people who did the work. So in my lab there are five really talented students who contribute to the work in encyclopedia. Joe Moore is the ringleader for these five people. And then we used, we had lots of discussion with John's group about DNA data. And then we worked with Mark Gustin's group on predicting enhancer-like regions, Mark and Anorak. And we're very grateful to all the wonderful data in ENCOCEL Consortium and VISA enhancer and the mouse data and it's a fun place to work. Okay, any questions? I was wondering if you could say some more about how with your definition of enhancer and promoter you're able to discriminate between enhancers and promoters because sometimes you'll have DNAs in H3K27 acetylation also promoters and what is the overlap between your enhancer and promoter set? Right, that's a really good question. So if you use both DNAs and H3K27 AC by combining them, we do get a lot of the regions that are just promoters, they are right at the TSS. Right now we don't have a functional definition as to whether or not these are actually promoters or promoters that may function also as enhancers. So right now we don't discriminate them any other way but in visualization, so we color them differently. So we're hoping to serve end users who will be more like local specific, they'll go in and if a region is really close to a TSS we give a different color. And we also make predictions using K4ME3. So if the prediction using K4ME3 gets a higher signal than the prediction using K27 AC this probably will be more likely a promoter region. Just a very technical question. So I'm very interesting about predicting target for enhancers, how is it your method available and which kind of data do you need as an input? So because I guess you need to train a training set and a test set. Right. Do you need high C? So the first section of the last portion of the talk I know there are lots of sites. So it was about building this benchmark. So we have this benchmark that includes chia pad, high C, promoter capture, high C, EQTL. So anything that could indicate enhancer gene links. So we compile those as a benchmark. We will release the benchmark as well so that people can compare their performance. So that's the yardstick for us to evaluate these all different options of the methods. If I have like a few snips for what I'm interested in, can I run your method? Well, sure, hopefully we have already run the method on all the encode data types unless you have your own data type, sure. You should be able to, we will definitely in the encode spirit, we will release the method. So here are the features we use for the model. And then if you only have DNAs or only have K27AC, we believe still works pretty well. So we could make variants of the method and make it available. But right now the plan is to build this encyclopedia so we run the method on all the encode, the whole panel of encode cell types. So you can just go and visualize your region with your SNP. Thank you. Can you compare the performance of your predictions with Chrome HMM and Segway? I mean, they're doing a lot of similar things, especially like enhancers and promoters. Yeah, that's a really question. So are you asking about the performance of predicting the enhancer-like regions or predicting links? Maybe like a functionally validated, yeah, as well as links. So I think it's more appropriate to compare the performance for predicting the enhancer-like regions themselves because Chrome HMM doesn't really predict target genes, Jason's sitting there unless I'm wrong about it, just identify the regions. So we have, I forgot to include those slides, we have done enrichment analysis and the regions we picked out using this simple approach is very strongly enriched in the enhancer states, the regions that are assigned by Chrome HMM to be enhancer state. So active enhancers, so highly enriched, but exactly how well one works than the other, we don't have a really good go standard to compare, but what we're saying is this is a very intuitive approach, just really simple and you can see the signal in front of your eyes. There is good reason to think perhaps Chrome HMM could do better because it's integrating across multiple histone marks, right? It's using more input. Certainly as part of the encyclopedia, we also have Chrome HMM predictions available so you can visualize them in combination. Maybe Shin-Ni's a microphone. Yeah. The microphone is coming. I guess this is just a little bit of elaboration on the question that was just being asked. So if I want to run this analysis on a bunch of SNPs, I understand and I take your point that if we have leads and a bunch of like four or five SNPs or 10 or two, at least even like 100 SNPs, then okay, I can go one by one and visualize this and therefore shortlist. But if I want to start with like 50,000 or, you know. You were saying that's way too many. Way too many. Then is there a way to access or like will this code be available? You know, for us to be able to use and further shortlist. So the visualization tool we have built, you could batch download the data if you so choose. So we will make the download available. But also I think we should allow people to upload a file to signify a list of regions Would that address your question? Yes. Okay, so we can easily implement that functionality. So you can upload like a bad file with 3000 regions and then you can download the results. Thank you. Thank you. Thank you.