 All right, thank you for coming out this morning. My name is Jill Moore. I'm a graduate student in G. Ping Weng's lab at UMass Medical School. And today, I'm going to be talking about two different topics. The first is the encyclopedia that the ENCO Consortium is working on. And the second is variant annotation using two different tools from ENCO Consortium groups. So the first question is, where is this encyclopedia? So as Mike Payson already talked about, ENCO stands for encyclopedia of DNA elements. And so far, ENCO data producers have generated thousands of experiments in humans, including DNA-seq data, transcription factor chip-seq, histomark chip-seq, and a wide variety of other experiments. But the question remains is, how do we actually integrate these experiments and assays together? And how do we integrate data from other groups, such as roadmap, so we can actually annotate these functional regions of the genome? And then how do we build and visualize this encyclopedia from these functional regions? So this is a list of different genomic annotations that you can look at, and it certainly isn't an entire list. There's some that are quite on the simple side in terms of they only require one type of experiment or one type of asset to look at, such as gene expression or the process peaks from DNA-seq or chip-seq. But then you have more complicated annotations that require you to merge and combine different data sets. So for example, if you want to annotate regions of the genome, like chrom HMM does, which we'll talk about later today, or perhaps predict target regions of regulatory elements. And what we're going to be talking about today is actually annotating candidate enhancers and promoters. So right now we've been working on this, and this is our current pipeline for doing so. And this work is all done by Michael Puccal in our lab. And our first step was to actually define what we call DNA-seq master peaks. And so these peaks are a set of unique, non-overlapping peaks, and they're representative of DNA-seq peaks that you see in a region. And so these peaks are going to span all data sets, and they collectively cover about 20% of the genome. So we incorporate both ENCODE and RodeMap data together to generate these master peaks. And so this is done by John Stam's lab. And this is just a quick visual of how this is done. So you have these different cell types here, as well as tissue types from RodeMap. And you have all of these called DNA-seq peaks. So you actually merge them, so if you look visually, so that they're all on top of each other, and this is your linear genome. And we are going to pick out a representative peak for this entire cluster. So in this case, we would use the peak with the highest signal, which in this case is this peak that's labeled 48. So by doing this, we're able to get a representative peak for all of these DNA-seq clusters of peaks across cell and tissue types. We next, after we have this list of master peaks, we separate them into two groups. We have the TSS proximal group. So these are all of the master peaks that are within a two kilobase window of a TSS. And then we have all the other peaks, which are the distal peaks. Our next step is to then annotate these peaks using other data sets. So one option that we do is intersecting these master peaks with transcription vector to see peaks from all different cell types. And the other step is actually looking for enrichment in histomark signal. So the way we actually do this is we look to see, is this peak compared to just random regions in the genome that we select enriched in a particular signal? And what this looks like, this is an example here done in GM12878 cells. This is looking at distal DNA-master peaks and the enrichment in this signal compared with the background. So the background here is this light green. So you can see that most of these random regions that we select are enriched in H3K27, but these TSS distal master peaks that we see are actually highly enriched in H3K27 acetylation signal. So by essentially creating this background distribution, we can create a cutoff and select anything that's above this cutoff as being enriched in this signal. So for this process, we focused on four different histomarks. One did H3K4Me3, K9 acetylation, K27 acetylation, and K4Me1. And each one of these has a slightly different properties. So for example, H3K4Me3 is known to be enriched at actively transcribed promoters, whereas H3K27 is usually enriched at active enhancers. So right now, these are the current annotations that you can use and apply to your own research. So we have two groups of regulatory elements, and so we have proximal regulatory elements and distal regulatory elements, and these are based on the proximal and distal DNA master peaks. We also have two groups that incorporate transcription factor binding. So these are the proximal and distal master peaks that are annotated with the transcription factor peaks. And then finally, we have candidate promoters and candidate enhancers, and these are the proximal and distal master peaks with the enrichment in the histomarks as well. So the next question is, how can you access these annotations? So this is the encode portal that my character showed, and here if you look under the data tab, there's a tab that's labeled annotations, and so if you go to that tab, you'll actually see a little brief explanation of what I'm talking about, as well as an option to visualize any of these annotations on the UCSC genome browser and wash your genome browser. And here you actually have a listing of all the tracks that I'm talking about, and you can actually download them locally if you want as big bed files, or you can download them to your local cluster server as well. So this is just a quick visualization of what these tracks look like on the UCSC genome browser. So here you actually have a gene here with the transcription start site, and on the top is proximal regulatory elements in the dark green, in the light green we have distal regulatory elements, and then we have candidate promoters here, and each one of these tracks is enrichment in a different histo mark. We have distal regulatory elements that are annotated with the histo mark, so we call these candidate enhancers, which you can see each line is a different enrichment in a histo mark. And then here we actually have all the transcription factor data that's been intersected with the regulatory element. So we both have proximal and distal, as well. If you want more information about any of these particular regulatory elements, so this is clicking on one of the proximal elements, you can actually see which cell types and tissue types went into creating this element. So for example, for this proximal regulatory element, all these different cell types were the individual peaks that went into creating that master peak. These are some other useful tracks to incorporate into your analysis that are on the UCSC genome browser, including the different gene tracks, the gen-code annotations, and there's also other ENCO tracks such as the integrated regulation, and then also the genome segments from tools such as ChromHMM and Segway. Michael's also incorporated this data on the WashU EpiGenome browser, and so if you prefer to use this, we also call the tracks on here, and once again, they're separated into proximal and distal regulatory elements of candidate enhancers and candidate promoters. If you click on one of these regions for more information, you'll once again see which cell types and tissue types went into creating this particular element. So this slide is just for your own information if you want to use these genome browser links, and so these are just some future directions of the project. Michael's hoping to make this whole project open source, and he's currently working on generating mouse annotations as well, so in case you use mice as a model organism, you can actually use an analogous encyclopedia that's created the same way as the human one. And then finally, we want to add more data, so we want to actually refine our use of transcription factor data since certain transcription factors bind and enhancers versus promoters. We'd like to incorporate this data into our project, and then finally incorporate other data such as RNA-seq and the 3D contact data like Chiapet and Hi-C. And then finally, also use tools such as ComHMM and Sideway to further define and annotate our encyclopedia. So now I'm changing gears a bit. I'm going to be talking about variant annotation using RegulumDB and Haploregg. So the motivation behind this is that I'm sure you've heard at this conference and even earlier today in this session is that the majority of variants reported by genome-wide association studies are non-coding regions of the genome. And often the variant that's reported by the GWAS is not necessarily the causal variant. So the motivation is that by using ENCODE data you can actually annotate non-coding regions of the genome and try to figure out how these variants are contributing to the phenotype that you're interested in. So these are two tools from ENCODE groups. First is RegulumDB and the second one is Haploregg. And so RegulumDB is from Mike Cherry and Mike Snyder's labs. And what's nice about this tool is that it actually gives you a score for how likely a variant is to disrupt transcription factor binding. So and these scores are based on the different lines of evidence that overlap your variant of interest. So for example, there's evidence such as has your variant been annotated as an EQTL? Does it overlap a transcription factor binding site? Does it overlap a DNA's peak? So depending on the different combinations of these lines of evidence you will actually have a different score for each one of these variants and the lower your score and the higher the letter the more likely it is to affect transcription factor binding. So if you actually go to the website this is what it looks like and you're able to enter your data in a variety of ways. You can enter your SNP IDs directly. You can enter a range that you're interested in or you can also upload bed files and VCF files if you have a lot of SNPs that you want to investigate. So in this case I just chose a region on chromosome two from 20,000 to 30,000. And when you click submit your results will be presented as follows. So in this particular region there was 44 SNPs and it will rank them by highest or most significant score to lowest score. So if you wanna find out more information about one of these SNPs you can just click the score itself and it'll bring you to a page that has a whole bunch of data on it. So it starts at the beginning by having just a small snapshot of the UCSC Genome Browser and here you'll have ENCODE tracks for example, DNA hyper sensitivity sites, transcription factor data and some other information like conservation and other SNPs in the region. As you scroll down you'll first see data about protein binding. This is where you'll find information for which transcription factors are bound and overlap this region. So in this case we have a protein CEBPB which is bound in this region and you actually have cell type information as well. You also have a tab called motifs and so there's two types of motifs here. There's motifs that are calculated just based on sequence which are the PWM category. So this is based just on the probability weight matrix for the motif. And then you also have a category called foot printing which refers to DNA's footprints. So for those also you have cell type information as well as the actual logo and then you can have a reference as well for where these PWMs or these footprints were originally reported. We also have a table for chromatin structure. So this is where you'll find DNA seek and in some cases fair seek information. For all these cases you're gonna have cell type information as well as the original source of the data and then any other additional information such as possible chemicals that were added for the assay. And then you'll also have histone modification data and in this case you actually have chromatum M annotations which you'll learn about a little bit later in the session. So we have the predicted regulatory state as well as what tissue type this state's predicted in. So this is all from the roadmap data. So for example, in this case you have that this variance and enhancer that's predicted in the esophagus. So if you want to more systematically analyze your variance you can also download all this data. So if you go to the downloads page you can download the entire database so you can investigate a little bit more systematically. And also if you go to regular.stanford.edu slash gwatch you'll actually see curated sets of variants associated with different phenotypes and diseases and you can download for example an entire list of annotated variants associated with aging or asthma. So you can also, if you're focused on a particular disease you can also go about investigation that way. We're now talking about hapleregg which is a tool from Nellis Kellis's lab. And currently it's on version four so previously it was version two and three so this is just a recent version. And it works especially the same way at the beginning where if you want to investigate a certain SNP or a certain region you can query that in the box. So here I'm actually looking up the same SNP that we just looked at in regular MDB. This time when you hit submit however you'll notice that there's actually a list of SNPs as opposed to just one SNP that you queried. And this is because hapleregg will list all other variants that are in LD with the variant that you queried. So here the variant that we queried is in red and by default it uses a European LD structure and it reports all variants that have an R squared of over 0.8. So this way you can actually investigate not only the SNP you're interested in but possible SNPs that are also in LD that may be driving this association. So when you look on this page here it's a summary of each of the SNPs and so you have information for example about the LD in both R squared and D prime for the queried variant. You have information about the reference allele, the alternative allele and the frequencies in different populations. And then you also have an overview of the different data types that you're very intersex. So for example if it's predicted to be a promoter region or an enhancer region by Chrome HMM you have that information there as well as if it's overlapping a DNA hyper sensitivity site or transcription factor binding site. So if you want more information on a SNP you can actually just click on it and you'll be brought to a detailed view for that SNP. So once again this is actually where all the information is as well if you want more information. So you have details about the sequence. For example the reference and alternative allele. You have information about the closest gene both annotated by GenCode and RefSeq. You also have this nice chart which is new in version four. Or you have every single epidenome that was looked at by the roadmap epigenomics project as well as a description of this tissue and then whether or not it overlaps an enhancer or promoter or any of the Chrome HMM annotated regions. So using two different models which you'll hear about later on. So for example in these cases it's predicted to be an enhancer in these particular cell and tissue types. So it actually gives you the entire matrix all together so you can look all at once. And also you have data once again about proteins that are bound. We see once again C-E-B-P-B and then we also have information about E-E-Q-T-L studies as well. And the final piece that I just want to talk about is these regulatory motifs. So once again, just as we saw in Regulome-D-B that is overlaps a motif for C-E-B-P-B as well. But what's nice is that you also have calculations for the log odds score for both the reference allele and the alternative allele. So if you actually want to see how much the SNP may be changing the motif you can actually have a quantitative assessment of that. And so this is just a page talking about the different options that you can use. It's really important that you make sure use the correct population during your investigation. So you can see there's a different hat map and 1000 genomes populations listed there. And you can also change the R-squared value that you want to use. And the final feature I want to talk about is you can actually investigate gene sets as a whole that have been, sorry, variant sets as a whole that have been associated with different diseases. So these curated lists where you can look at individual studies or combinations of studies for a particular phenotype. So in this case, we're gonna be looking at all these SNPs that are associated with asthma in a European population. And when you run your analysis, you'll see results like we saw before but you'll also have this nice table where you're actually looking for enrichment and enhancers that have been predicted in different cell and tissue types. So here, you can actually see there's these different cell types and tissue types and you have the observed number of SNPs for your particular category in overlapping these regions as well as the expected number if you looked at all SNPs or the expected number if you looked at other SNPs in the GWAS database. And so you can see, for example, your asthma SNPs, there's enrichment in lung cell lines which might be clinically relevant to what you're looking at. So just to finish up, I'd like to really point out Michael Puccaro who did a lot of the work with the encyclopedia as well as Somya who's recently graduated from our lab as well as all of the other members of the ENCO consortium as well as the Stan lab who did all the work with the DNA semester peaks. So if there's any questions, feel free to come up to me afterwards. Also, if you want more information about the tutorials, I have more in-depth tutorials for both of these topics that are on the ENCO website.