 Ydw i'w bwysig chi'n gwybod i gweithio i gŵr rôl? Felly dyna o'r cyfraniad llun o'r ddaeth, fe wnaeth hi'n gwybod i ddaeth gŵr o gŵr y ddechrau. Felly dyna o ddatblygu'u GBS. Felly dyna o'r ddod gyfraniad llun o'r ddaeth, fe gael a'i gŵr y bigfiol a'u gŵr y llist i'r ddaeth iddyn nhw yn cael ei bod i fan y gallu ar gyfer wneud. Ac y gallwch chiant yn gwybod i gŵr i'r gŵr? Os wedi'i bwydrych chi'n gynhwys gyd, nid ydych chi'n gynhwys gyd am gylwyr gwneud iddyn nhw, amdrwy'n chyf Mystery. Os ydych chi'n gynhwys gwirgylau gwirgyllustydd sy'n gwybod yn gynhwys cyffredinol o'r gysylltu sy'n gwybod dweud amdano'n gwneud. Mae ddwych yn gwahanol! Os ydych chi'n gwahanol, mae ti'n hawdd. Rhyw gydag ddwy yw'r gŵl. Fe ddweud! Ddwy roedd y ddych chi'n gwahanol in various different formats, which I thought you could all see earlier, and we've got our gene model with our different kinds of consequences, where our variants might be falling, and you want to get, so in ensemble we use what are called sequence ontology consequence terms, so these are a kind of standardised vocabulary of consequences that might result from your variants. We also give you, you might also be interested in getting sift and polyphen scores, so these are scores for changes in amino acid sequence. They're based on how well conserved the protein is and the chemical change in the amino acid, and I really think that the people who created sift and polyphen must not have been talking to each other because they both came up with scores out of one and gave us completely opposite scales for them, which is very confusing and I never remember which one is which, which is why I have a slide for it, and also why they helpfully give predictions as well, so sift score out of one, everything between 0.05 and 1 is fine, protein is going to work, don't worry about it, everything between 0 and 0.05 is deleterious, it's going to damage your protein, it's not going to work. Polyphen is the opposite way round, they have probably damaging, possibly damaging and benign, and 1 to 0.1 is probably damaging, 0.2 to 0 is benign, so it's slightly confusing the way they do things. You also want to know if your variant affects transcription factor binding, so we have our transcription factor binding motifs, we have these Jasper matrices showing us how strong our transcription factor binding is, how strong our motif is based on the frequency of the different possible bases within the motif, and if we take this T in the middle we switch it for a C, we can see that the score, the strength of this transcription factor binding has changed, so this is something we also might want to know about our list of variants. We also might want to know if it's known, is this a variant that exists, that has been written about, that has been put it found by 1000 genomes, that has been put into DB SNP, that has been identified by the GWAS catalogue and associated with phenotypes, all of these different things. And the way that you can find this stuff is using the VEP, which sounds like a variant effect predictor. There's a web interface, which I'm going to show you how to use. There's also a standalone Perl script, so the web interface is all pointing and clicking, it's nice and easy. The Perl script does not involve any kind of coding on your part. It is simply a script that you can run. There's extensive documentation on how to run it, telling you exactly what commands to put into your command line to make it do all the things that you need to do. You don't need to look at the code or read the code or whatever, it does that itself. You can also set up caches, so caches are a way to speed up the VEP, so if you've got whole genome variant calling, if you've got millions and millions of variants, that's going to take quite a while to run. You can speed that up by running it completely within your own system, rather than over the internet, rather than communicating with our database, and you can do this using an offline cache. There are pre-built caches available for ensemble species. They contain all the data that is needed to run the VEP, or you can make your own. You need a GTF and a Fast Day file, which means that you can run it for any species. It doesn't matter if the genome is not an ensemble, which means if you, for example, had a T-Rex genome with genes annotated on it and a bunch of variants between different T-Rexes, where you would have got all this data from, you could do that. So I'm going to show you how it works. I've got a list of four variants, and I'm going to show you what genes they hit, how to find out what genes they hit and what effects they have on them, as well as some regulatory features. So I'm just going to select this list. There are different formats of data that you can put into the VEP. This particular format is what we call ensemble default, which is a simple format that consists of chromosome, start coordinate and end coordinate, which in the case of SNPs is the same thing. Then the alleles, the strand, and the name. The strand and the name are optional. If you don't have the strand, just put, if you put nothing, it will assume it's positive. The name is helpful for tracing it back. So if I just copy that, I go back to ensemble, I go back to the home page, and I'm going to go to the variant effect predictor using this link on the home page. I'm just going to go into launch VEP, and because I've done, this was from a conference I was at just last week, because I've done jobs before, it gives me this jobs table, which is again allows you to save the data to your account, share the data with other people. If you haven't done a job before, you'll go straight to this page that looks like this. I got here by clicking on new VEP job. So this is my input format, and so the first thing is I'm working in human, which I'm going to leave as it is. You can give your data a name, which is a really useful thing to do, because as I showed you in the jobs table, you can list, then it makes it so much easier to find what's what, so I'm just going to go encode demo, and I'm going to paste in my data. One of the sort of quick tricks that I can do is I've got this button, instant results for first variant. If you're really, if you, this is a quick check to make sure you put it in the correct format. If you haven't put it in the correct format, it'll give you garbage here. So if you're not sure, this can be a quick check to make sure it's all worked. We do actually have examples of the different possible formats, so you could click on each of these to see what they look like. There's also, in the documentation, there is a page that lists all the formats as well, but this can be a quick way. So I've just pasted the data in, but you also have an option to upload a file from your system or to provide a URL for a file. So it's going to look at the, on symbol transcripts. You've got these output options. These might be shut down when you go to them, but I've just opened up all these menus. So we've got identifiers and frequency data. So it's going to get me gene symbols. It's going to find out whether these variants are already known. It's going to get me frequency data. The extra options, the thing to point out, is it's going to get me regulatory region consequences, which is what we're particularly interested in here. There's also filtering available here. So if you wanted to do filtering, if I go into advanced filtering, I can exclude or include variants with a minor allele or frequency greater than or less than any number I like in some of the thousand genomes population. So depending on what kind of study you're doing, whether you're working with finding maybe common variants that contribute towards common diseases or rare variants or whatever, you might want to use these filters. We've only got five variants, so I'm going to go with no filtering. And now I'm going to hit run, and you'll see we'll get back to that little table I had before. It's telling me that my job is queued. It's going to refresh for me every 10 seconds. And now I can see as well as the job that I did at another conference last week that it's there. And now it's done. I can view results. And I can see the consequence. So it's telling me I've done five variants. I didn't do any filtering, so there's still five. One of them is novel, was not already known. Four of them were known. I've overlapped two genes, seven transcripts, and one regulatory feature. And I've got summaries of the different consequences. You can see that it's now listing because all of them will hit multiple transcripts. It's listing the variants, their consequences, what genes they've hit, and what kind of effect they have. And the one I want to show you is we've also hit transcription factor binding sites. So this one is a transcription factor binding site variant. And if I scroll across, so it gives me the Jasper ID. If I scroll across, these were my misstants that are giving me sift and polyphen. It tells me, so this column here is motif score change. So it's telling me that the change, that the score of the motif has changed. This is now minus 0.085. It is a less powerful motif than it was before. It's likely to reduce the amount of transcription factor binding, possibly. We don't know. You would have to then assay that in that particular cell. But there's lots and lots of information that you can find out about the genes and about how it affects them. For the ones where we've hit coding sequence and it's changed coding sequence, we have quite a lot of information about the genes as well. There is, where are we time-wise? I want to tell you a few bits and bobs before I hand over to Jill. And I think we'll have about five minutes to maybe have a go at using the VEP yourself. I just want to tell you you can host an ensemble course. We will come out to your research institute. We will not charge you a fee, although we will expect you to pay our costs. And we can give you browser workshops. If you email me, I am Emily at ebi.ac.uk. We can arrange to set this up. And we can do normally a day course where we look at all the aspects of the ensemble browser. There is absolutely tons of help and documentation available. Ebi train online, I cannot emphasise enough how awesome Ebi train online is. There are loads and loads of free training courses that you can dip into at will that will take you through different kind of resources in the Ebi as well as kind of general bioinformatics topics. We have loads of online tutorials. We have a YouTube channel. We've got loads of little videos like that one I showed you a clip of earlier, which take you through different pages and different kinds of actions. We also have a help desk. We are helpdesk out ensemble.org. Anything that you don't understand that you think might be incorrect data, anything like that, you send us an email. We will do what we can to fix it or to help you understand what you're looking at. If you're working with our APIs and things, we also have a dev list, which is more of a public main list. So the help desk is private. So again, if you're working with sensitive data, you might want to email help desk. Whereas dev is public and you can get help from our developers as well as other people who are also using our APIs and things. The announced mailing list will just tell you about news. You can follow us. We have Facebook. We have Twitter. We have a blog. The blog's quite nice for when we've put out new data, we'll often have an article describing the rationale behind it, how we went about doing it, things like that. We also produce publications. You'd be amazed at how many people think they don't need to cite databases. You do. These are some of the papers. This is the team at our most recent retreat about a month ago. These are the people who pay for the project as well. Again, we've got names of the team and we get funding mostly from the Welcome Trust, but also from a number of other smaller grants as well, who we are eternally grateful to. I think we've got five minutes where you could have a quick go using the VEP or ask me any questions before I have to hand over to Jill. There's an exercise there. There's also a file that you need, which is this one, if you just take the URL of that file rather than, no, I think I put it correct in the input, but if you just take that file and you can use it to have a go at using the VEP, and now we've only got about three minutes, but that gives an opportunity for questions. Do you want to ask a question for the group or just to me? I will run over. Maybe we won't do an exercise then. So for the analysis of variants, do you have to assume that it is on a current region? You have to assume that it's in what? On a current region, like a coding on a gene? No, we don't assume that at all. We're using genomic coordinates. It might fall within a coding region. It might not, and we will give you data about it, whether it is coding or whether it is not. Because the scores that you presented, it was based on amino acid substitution, that's why I thought... Sifton polyphen scores you can only get for coding regions, but there are other bits of data that you can get for non-coding regions. So something like the motif scores, for example, you would only get for a transcription factor binding motif. Sorry? I have eight minutes. Oh, excellent. I thought I had less than that. Any more questions or would you like to have a quick go? Oh, over here. How do you know if the motif score change is significant? I don't know. That's something that we haven't analysed, but I could talk to our guys and see if that's something we could work in. The general question, what are the major differences between this encode data and ensemble? So what's the major useful difference or similar? The major useful difference between ensemble and encode and the ensemble. Okay, so we have encode data available and accessible in ensemble, but we also do that processing on top of it. So we're processing the data to predict activity from it. So instead of just saying, here's some histones and here's some other stuff, we're saying, okay, because of these histones and all this other stuff, we think this is a promoter, and that's the major difference. Can you point on changes that are available, for example, on ensemble that are not available in UCSC and vice versa, maybe not vice versa, but just one ensemble? Oh, I love that question, don't you? Like advantages, for example, in ensemble compared to UCSC or? So a lot of it comes down to personal preference. Most people start using one and get used to it, and whichever one somebody has started using, they always say, well, that's the most intuitive. And I've heard both sides, and it really is just which one you use first and which one you get used to. I would say the major thing that we have is programmatic access that is kind of our big seller, and I believe that the heavy bioinformaticians love us. Okay, I'll give you five minutes, then, to have a go with the exercises, and then we'll hand over to