 Okay, thanks everyone. So yeah, I'm a postdoc in Mike Snyder's lab, and today I'm going to talk about variant annotation using ENCODE data, and particularly I'm going to talk about regular MDB. And I should give a disclaimer, I didn't actually make regular MDB, but I was a pretty active user for a few years, and so I have the user perspective here. And so, I think everyone's probably pretty clear on the general idea, but let's go over it. So, ENCODE data can provide a lot of insights into the function of non-coding variants. I think that's sort of a given here. And different people have used it for different sorts of studies. People have used it to look at GWAS variants, and the idea being that if you have a particular tag snip from a GWAS study, there may be a bunch of variants in Lincoln's disequilibrium with that variant, and you can then look through ENCODE data and find out various variants that may be functionally relevant. And then there have been a number of studies on other diseases like cancer, and I had one of these where we looked for mutations in cancer that were recurrent across whole genomes, across patients, and then you can then look to see which of these are in regulatory regions. And so, in this case, you also want to overlap these regions with ENCODE data to see if any of these regions are in known transcription factor binding sites. And so I have a little graphic here on the side, and so the general idea is say you have a variant here shown in red. You want to see what's happening in this region. Is it in a promoter, an enhancer, and then in particular what transcription factors are binding this region. And so there's a few great tools for looking at this sort of data. One is regular MDB. The other is haploreg. I believe the next talk is unhaploreg. So I won't talk any more about that really. But let's just go through regular MDB. And so the first author on this paper, Alan Boyle, was the creator of regular MDB. He had some help, as you can see here, but this was his baby. And I'll put his email up at the end too. You can feel free to email me or him if you have any questions about this. And then similarly the next talk is about haploreg, so these are very similar databases. And so the general idea here is that it's a web-based tool which provides a very simple interface for retrieving site-specific regulatory data. And so if you have a genomic position or region in the genome that you want to know more about, you can use a web interface, and I'll go through it a minute on the website itself. And you can input it into the database, and it outputs all the bound regulatory factors and additional information from the ENCODE project. And so regular MDB has a bunch of different data types, and they're all listed in the columns of this table here. It lists all the, it has all the EQTLs in there, it has transcription factor binding data, mostly chip-seq data, it has match transcription factor motif data, so it looks through various motif data sets, and I'll show those in the next slide. It looks at, it has DNA footprinting data in there, it has DNA peaks, et cetera. And so Alan actually developed a scoring system for a UMDB where different combinations of evidence contribute to a different score, and generally if you have a lower score there's more evidence that that region or that site has some regulatory factors present. And so there's a lot of data in this database, and this is the most recent update. And so you can just go through this list, but there's many different conditions and cell lines from ENCODE and also from non-ENCODE and many different data types. So it's really supposed to be a summary of all of ENCODE that's in this database. And now I want to show you a quick example of how to use it, and let's see, let's pull this up. So here's the website. Can we see it? Let's zoom in. Okay, so it's regulome.stanford.edu. This is the main page. You can come here and there's some examples. So if you click, so this is where you input what regions of the genome you want to interrogate in their new line separated so you can put a bunch of them in at the same time. And so each of these little hyperlinks will give you an example. So you can put it in different, you can search in different ways. You can use dbSNPIDs, these RSIDs. You can use zero-based coordinates. You can actually use a bed file, a VCF file, or a GFF3 file. And so it has all these examples here so you can make sure you get the file format right. And so if you choose one of these or your own, you can then hit submit and let's see. If this doesn't work, I have some slides on this. But it should work because it just did. Okay, let's not do this. That's frustrating. Anyway, that would have been better, but we can just go through the slides. So after you hit submit, you'll come to a page that looks like this. And there'll be for every one of the coordinates or RSIDs that you entered, you'll have a different row. And it'll tell you what coordinates you put in, the dbSNPID, and then it gives you this score that I just described a minute ago. And it'll link out to UCSE Ensemble if you want to see a view in the genome browser. And then if you click on one of these guys, if you click on 2A, for example, you would then come to a new page that has all the data laid out to you visually. And so here is just a genome browser view. And I believe if you click this thing, it'll also link to the UCSE genome browser. And then if you scroll down, so if I'm actually going to click to the next slide, you see, first you see protein binding data. So for this is for one particular site, it'll show you all the different experiments where there was overlap of these different chip seek peaks. So in this particular experiment in the Hela S3 cell type, where they were doing pull R2A chip seek, there was overlap of a peak. And then it also links out to the reference of the actual data that was used to call this. And then so it has this protein binding data. It has motif data. So this is from TransFAC and Jasper, et cetera. And so it's motifs plus it's where that came from. So there's also these different methods that were used. There's the footprinting, et cetera. And there's also chromatin structure data. So this would be things like DNA seek. And there's histone modification data and et cetera, et cetera. So all those, I showed you a slide a minute ago with all those different data types. They're all here in table format. And then let's see. I should also mention, let's go back here. To the summary slide, so this is the, so when you have all your coordinates you want to enter and you hit submit, you get to this page like this and you hit download. You can actually download the full output. So if you say you have 10,000 of these things you want to interrogate at the same time, you can download and then you can get a file that looks like this, like this guy. And so basically it has all that data in a text file so you can parse it on your own if you're so inclined. And that's actually very helpful. And if you have more than maybe 10,000 or so, in addition to having the website will break at a certain number. If you put too many into the browser it just doesn't like it. So if that happens you can, we have an API, well we built a little program to scrape data off of regular MDB and that's hosted on Alan's GitHub page and so you can see it here. And with that, yeah, I'll take any questions. My email address is cmailtonnetstanford.edu and so any user kinds of questions you can ask me. I'm happy to answer. Any questions about how it was built or the data that are in there? You should ask Alan and this is his email address here. Yeah, so I'm also a user of the Hubbler, sorry, both, regular MDB. And I wanted to ask you if I'm not wrong, the encode data that is mining, it is the 2012. Is there any plan to be more updated? So yeah, Alan does have plans. I don't know the, I think Mike had to take off, he might know. But he did update it, I know not this last fall, but the fall before and I think he updated this last year as well, so he is updating it, yes. And they wrote a grant to continue working on it and continue updating. So probably regardless of whether he gets that funded, he'll continue, but you should ask him. And second question, these scripts that you mentioned right now at the end, is it a programmatic way to do the same search as you can do on the web interface or just to download files? It's just the first one, so yeah, it basically mimics being you and going in and it'll parse your input file of maybe a million or tens of thousands of different new line delimited searches you want to do, and it'll put them in maybe, I don't remember how I did it, maybe it was a thousand or so at a time, maybe a hundred, whatever, didn't bog down the website too much, and it'll just go through them one at a time and get the outputs, download the full output like I was showing you here, and it'll just do that repetitively, just one after another after another. And if that script on there doesn't do it, I have one that does, I don't remember what we put up there was a while ago, so you can email me if it doesn't do what I just said.