 I'm a variant annotation. Joe Moore from UMass will be talking about RegulomDB and HapLoreg. All right, can everyone hear me? OK. All right, great. So I'm going to be talking today about two tools from ENCODE Labs titled RegulomDB and HapLoreg. So the motivation behind using these databases is sort of a common theme that we've been hearing from the scientific presentations earlier today and probably something that you've run into in your own research is that a majority of variants that are associated with different diseases and disorders reported by GWAS are non-coding regions of the genome. Additionally, a lot of the variants reported by GWAS aren't necessarily causal. And they may just be in high linkage disequilibrium with a causal variant. So we really want to get at how are these changes in base contributing to these disorders and dysregulation of possible genes? So we can use data from ENCODE to annotate non-coding regions of the genome. And by synthesizing all this data together, we can predict the function of disease associated non-coding variants. So there's two tools. These are the addresses for each one of them. We have RegulomDB and we have HapLoreg. So starting first with RegulomDB, this is from Mike Cherry and Mike Snyder's Labs. And what's great about this tool is it actually takes all sorts of data from ENCODE and RoanMap and other databases and essentially distills it down to assigning each variant a score. And this score is how likely is this variant in essentially altering transcription factor binding? So if you want to follow along, you can go to RegulomDB.org. And this is the web page here. And you can essentially enter many different types of data to try to explore different variants. So you can enter DB SNP IDs themselves. You can also upload bed files or VCF files. And you can also enter coordinates directly into the box. So for example, if you want to look at a region, say chromosome 2, and you want to look, for example, at a 10kb region between 20,000 and 30,000, you can just submit that to RegulomDB. And it's going to search through to find all the variants in the region. Internet seems to be slow. All right, that's okay. We'll just use the slides instead. It's always good to have backup slides. So if you enter this region into RegulomDB, you then get an output here where you have 44 different SNPs that are in this 10kb region. And it ranks each one of these SNPs by a RegulomDB score that we see here. So this score is based on the amount of different analyses and a number of different lines of evidence that intersect that particular SNP. So if we want to click on our top SNP here, we can actually see that this has a score of 2a, which is likely to affect transcription factor binding. So we want to scroll back up to the top. We see that there's different categories for, depending on what line of evidence actually will overlap each of these variants. So if we see here, we have category one. These are for variants that are actually even reported to be EQTLs in different published studies. And then you can break those down further based on do they overlap a transcription factor binding site from ENCODE? Do they also overlap a motif? And if this motif matches the same transcription factor that's found in ENCODE, it has a higher score than if it's a different motif. We also look, and we see we have DNase footprinting data as well. So this is a small region of DNA that's actually protected from the DNase one. And you can look to see if there's a motif there for a transcription factor. So it's even stronger evidence that a transcription factor is actually bound at this site. And then you can also look for general regions of DNase hypersensitivity and different cell lines. So based on different combinations of these different lines of evidence, you will assign a score essentially from one to six. And so the lower the score, the more significant and more likely that this variant is going to affect transcription factor binding. So this is actually finally loaded, which is nice. We can see here, this is what you see when you click on one of the variants to learn more information. So this one here has a score of two A, which means it's likely to affect binding. And that means it has every line of evidence except for being reported as an EQTL in a study. So in the very top here, we actually have a very small shot of a UCSC genome browser with ENCODE-related tracks. So we have a track for DNase one hypersensitivity clusters, transcription factor binding annotated with motifs from factorbook.org, and we also have conservation as well. When we scroll down, we actually see everything broken down into specific categories. So we have protein binding data from ChIP-Seq. So here we have a transcription factor C-E-B-P-B. It's actually in hella cells that's bound to this region. And so for all these cases, you have the information about the assays. So you have cell type and tissue type information, which is really important when you study these SNPs. And then you also have the reference. So in many cases, it's from roadmap or ENCODE, but in cases where the data's from a specific paper, you actually have the reference. So you can go and see what they did in that particular paper. So scrolling down, we actually see there's motifs and there's two different methods here. One is just using PWM scores. Essentially, you have the matrix that gives you the probability for each position, and this is just searching the sequence for matches. And then you also have results from footprint, DNA's footprinting, which I talked about earlier, but you actually have the small region that's protected by the protein itself. And here we have actually DNA, we have footprinting in both hella and hepatocytes as well. We scroll down a bit further. We see that we look at chromatin structure, and this is primarily DNA-Seq data, and there's a couple of Fair-Seq data sets as well. And here we see that the region is actually extremely open and by DNA-Seq in multiple different cell lines, and we also have Fair-Seq on the bottom as well. On the bottom, we have data from histone modifications. So here, essentially we have annotations from ChromHMM, which you'll learn about tomorrow, but ChromHMM annotates the genome based on different patterns of histone modifications into regions such as promoters, enhancers for press regions. So here we have annotations, which you can actually sort. And you can see this region is actually predicted to be an enhancer in many cell and tissue types. So you have tissue groups here and the specific tissue listed as well, and this is all from road map data. That's essentially a summary, the very periphery of regular MDB. We'll get into some more analysis later on with some examples. But we also have a tool called Hapleregg, and this is from Manolis-Kellis' lab. And what's nice about these tools is they have similar features, but you can use them together to get essentially a really nice view of what your variants are possibly doing. So for Hapleregg, this address is actually through the Broad Institute. I find the easiest way to get there is to actually Google Hapleregg since the address is a little long. But, so this is the second version. They have a version three right now in the beta form, which you can explore as well. But it's very similar in the fact that you can enter an ID of a specific SNP itself, or you can enter a list of different SNPs or a region of interest as well. So we're actually gonna enter here the same SNP that we just looked at in regular MDB. At least the internet's working now. So if you notice here, we entered a query for a single SNP, but we actually have an entire list. And this is because Hapleregg actually returns all other SNPs that are in LD with your queried SNP. So our queried SNP here is in red, but we also have reported all the other SNPs that have a least an R-squared value of 0.8. And so we're looking down, we can actually see sort of a summary of all the different intersections with these SNPs. So it's nice if you want, if you have a tag SNP and you're not sure if it's causal, you can actually see if there's other SNPs that may be causal in LD with your tag SNP. So for more information, you can click on the SNP itself. This is sort of just an overview. So if you click on the SNP, you get a nice detailed view. And so you have links to DB SNP and you also have a nice summary of different sequencing features. So you have the position on the genome, you have what the reference allele is, what the alternative allele is, and then you have the minor allele frequency in different populations, which is from the 1000 Genomes Project. You also have different conservation scores. And then if there's a functional annotation in DB SNP, for example, if it's been functioned, if it's been annotated as a missense mutation or a sense mutation. You also have data about the closest annotated gene that has both information for gen code genes, as well as ref-seq genes. And for example, if it's upstream downstream or within the gene itself. And then once again, you have these regulatory chromatin states from ChromeHMM done on ENCODE cell types, as well as roadmap tissue types. So you can see here different predictions, what you'll learn about tomorrow, the different states. You have this region as being an enhancer and then enhancer in these tissue types as well. Finally, you also have DNAs data as well. And then once again, we see a protein from ENCODE chip-seq data. We have CEBPB. And finally, what's a nice feature about haploreg as well is we have these regulatory motifs as well. But it gives you some other tools, actual scores for these motifs. So it gives you the log odds for the motif with the reference allele as well as with the alternative allele. So you can actually see, for example, we have these motifs. And we can actually see that not many of them change very drastically between the two. And if we actually go back to RegulumDB, we can see that the position that this particular variant overlaps isn't a very important position in the motif itself. So it could still affect transcription factor binding, but doesn't necessarily drastically disrupt the motif that we're interested in. And you can see this reflected in the actual log odds scores between the reference allele and the alternative allele. Just one, another important feature for both of these tools is you can actually make your search base sort of on the disease side of things rather than an individual SNP. So for haploreg, you can actually look through different sets of SNPs that are curated from the NHGRI database. So here it has every single study from the NHGRI database as well as the SNPs that are reported in those studies. And it also has sort of an umbrella category of all SNPs that belong to a particular disease. So for example, if you wanna search by asthma, we actually see that there's 17 different studies in the database reporting 62 different SNPs. So you can actually search for all of these 62 SNPs together at one time, kind of combining all the studies together. So if we actually wanna search this time, we actually see that new blocks come up. And what's interesting about this is that now that we have a collection of SNPs, you can actually look for enrichment in different cell and tissue types for enrichment in regulatory elements as well as DNA hypersensitivity regions. So in these, for example, these asthma SNPs, in each one of these cell lines are enriched and more likely to find these asthma SNPs in these regions than just control SNPs by chance. And what's nice is you start to see some disease related cell types and tissue types, for example, such as embryonic lung fibroblasts, small airway epithelial cells, et cetera. So you can start maybe trying to look for disease relevant cell and tissue types. One important aspect of haploreg is making sure you're running with the correct options. So there's quite a few different options you can look at. One is the LD threshold. So this is your R squared value. It can range essentially from 0.2 all the way to one. And if you choose the non-applicable option, you're just searching for the SNP itself as opposed to any LD. What's actually really vital is choosing the correct population to do your LD calculation. So by default, it's set on the European population, but for example, if you're looking at results from a GWAS and a Chinese population, you wanna make sure that you click the Asian population for your LD calculation. And then the last options are actually for here just for the display default. So for example, you can do your display encode epigenomes or roadmap epigenomes, but when you actually click on the SNP itself and look for more details, both of these are displayed. And then finally, for your enhanced enrichment analysis, you can choose different backgrounds as well, which is really important. So if you're doing results from a particular GWAS, like one individual GWAS, you wanna make sure you actually match the SNP chip that's used in the study. So here it has aphymetrics chips and illumine chips as well. But the default is all SNPs in the 1000 Genomes pilot. We're just gonna do a quick example. And then we have some exercises as well for you to work on. There's three different problem sets as well if we have time. So this SNP here is in the NHGRI database. It's associated with inflammatory bowel disease and Crohn's disease. And it's actually quite a significant p-value in both of 10 to the negative 16 and 10 to the negative 19. And so this was from a study in 2010 and a study in 2012. So this SNP however is in the non-coding region of the genome. It's actually in this gene called SMAD3. And there's four different transcript variants of this gene and it's in the intron of three of them and just upstream of the start site for the fourth variant. So we don't really know how this contributes to IBD and Crohn's disease. So we wanna look at the SNP further. So what we can do is essentially start off with Regulum DB to see what score does Regulum DB give this particular SNP. So we just search for it. And we actually see that the SNP has a score of 2A. So it's very likely to affect transcription factor binding which is sort of exciting since it's been found in this GWAS and may actually be contributing directly to disease. So to find out more information, we can click on it. And we see in the small genome browser view there's a lot going on at this locus. So we have really high levels of H3K27 acetylation which we know marks active promoters and active enhancers. We also see that it's in a DNA hypersensitivity cluster and that it's overlapping transcription factors that are bound there as well as transcription factor motifs in green. So as we scroll down, we actually see that there's a lot of proteins bound here. So you can actually sort by protein and see that there's a lot of action essentially going on here in all different cell types. And you also notice too, there's a lot of June. It's a transcription factor here that's bound in this region. And as we scroll down further, we can also see that there's a lot of motifs found in both DNAs footprinting and by just looking at the sequence itself. So we have motifs for this Bok one. And we also have motifs for AP one. And so June is actually a component of AP one transcription factors. That's exciting to see that we actually see a motif or transcription factor that's bound at that region. And as you can scroll down again, same story. There's a lot of activity going on. There's a lot of DNA hypersensitivity regions and many different cell types. And there's also predicted to be an active TSS site as well as enhancer regions in this region. So if we continue on, now say we wanna go to haploreg to make sure, well, we are pretty sure that something is going on at this locus. But it's also possible if there's other SNPs in LD with our lead SNP that could be contributing as well. So we wanna look at other SNPs that are in LD. We can use haploreg here. For our options, this particular study was done in European population, so we can keep that set there. And we can submit our query. And now we can see all the different variants that are in LD with our queried variant in red. And it actually looks like our queried variant is most likely, would be most likely to be the causal variant in this case. This is because it has enhancer marks and eight different cell types. It has DNAs and 41 different cell types. And there also has many bound proteins in different cell types. So if we also wanna look a bit further, we can see once again, we know there's a lot going on in this region. But if we scroll down to these motifs here, we actually look at this AP1 motif. And in many cases, there's actually a drastic change in going from the log odds, from the reference to the alternative. So it's possible that this variant is actually disrupting transcription factor binding of AP1, which is also a component of AP1 is June, which we saw was bound from the encode data. So it could be drastically affecting the binding of this transcription factor, and then possibly leading to dysregulation of SMAD3. So just one last part. This is our regular MDBs, sort of version of curating all the variants in the NHGRI database. So here they actually list different diseases and phenotypes, which you can then click on and it will list different variants that you can then explore further, and it will give the score for this variant in regular MDB, as well as also LD, steps that are in LD for this particular step as well. So you can also search through this, which is at regularMDB.org slash GWAS. So now, I think it's maybe 10 minutes. About 10 minutes, there's three different exercises. Or you can also just explore the databases if you have a particular variant that you're already interested in. The first exercise has a lot to do with LD, and the second has a lot to do with looking for enrichment in particular cell and tissue types, and the last one really demonstrates the importance of choosing the correct population when you're doing your analysis. So if you have any questions, feel free to ask. I can come on individually as well.