 Felly gwi'n cael ei gweld yma'n ymwneud, Eemwy Perry. Mae'n i ni'n gwellthu yn cyd-gwrs i ddim yn ei fod yn ei gawrder ac mae wedi'i cael ei gweithio gweddol yn ymddangosgymeth ar y cyllid cyfolygu. Rwy'n teithio i'r cy lle i mi. Nid ydych chi am eich gweld? Rydw i gweithio'n gwybod ar y mes. Rydw i wedi gweithio hefyd? I'll try and speak very loudly as well. So what we do is we try and kind of let people know about what ensemble is doing, we also answer help queries, we produce online help and documentation and we also teach lots and lots of workshops like this one. Please interrupt me at any time. I would, if you're ever confused, if anything that I've said is not clear, I would far rather you stop me and then I can go over that and make it clear to everybody then for you to sit there being confused. So do please stop me at any point today. As has been mentioned before, all the materials for today's workshops are available so you can get a copy of the presentation and also a copy of the hands-on tutorial that we'll be doing. So ensemble is a genome browser like UCSC and we have lots and lots of features. So we have what we call gene builds for about 70 species. For human and mouse, these are the gen code gene sets. We have gene trees so every single gene that we have we compare pair-wise to every single other gene. We then produce gene trees and we infer orthologs and paralogs from these trees. We have a regulatory build which is what the focus of today's workshop is going to be on which incorporates data from ENCODE as well as other sources. We have variation displays, we have a tool called the VEP which we'll also look at. We allow you to display your own data. We have a tool called Biomart for Data Export which we'll have a quick go with. We also have programmatic access via APIs. So we have a Perl API and a REST API. If you're interested in either of those things please come and talk to me. We are also a completely open source so every single bit of code, every single bit of software is free, is open, is available for you to use to do whatever you want with. Regulation is one section of what we do and what we're trying to do is annotate the genomes with features that play a role in transcription and regulation of genes. So we're talking about the actual features here, the promoters, the enhancers, the insulators and we're using data such as predicted open-close chromatin from DNA's one sensitivity. We're using transcription factor binding sites, we're using the epigenetic marks, the histone modifications and the DNA methylation and we're using RNA polymerase binding. You'll recognise that most of these features are things that are produced by the ENCODE project for the cell lines that they work with. So we're predicting these kinds of features but some of the things that we're not doing. So we're not linking them to genes. We show them where you are, we show you where the features are and we let you make your own inferences from them. We also don't link to gene expression so you can access the gene expression from all the cell types, you can see where the features are again, make your own inferences. The reason for this is this kind of data, we don't have kind of assays saying this enhancer enhances this gene. That kind of stuff just isn't there on a large scale, there's a few assays out there but it's not on any kind of as much of a large scale. So what we're saying is we think there's an enhancer at this particular position, we'll tell you why we think there's an enhancer at that particular position and we're leaving it up to you to do the assays that will link these kinds of data together, that will link the enhancer to the gene that it activates to show kind of what binding is involved when it is activated, when it activates particular genes. So at the moment the current data we have, we have ENCODE data for human and for mouse and we also have data from roadmap epigenomics that we produce this regulator we build with but we have stuff coming up in future. So for human we've got blueprint coming in, we're getting involved at the early stages with the FANG project which is a very similar project but with agricultural species and also with the zebrafish, ZENCODE, we're getting in at the early stages making sure that we're working with the data coordinators for these projects so we can pull in the data as soon as it comes out. We're working with a subset of cell types and the reason for this is as you'll see when I start to talk about the build we need certain amounts of data. We need in order to build our predicted promoters, enhancers, insulators, we need CTCF binding, we need a marker of open chromatin and we need these three histone modifications. If we have this data for a cell type we will then carry out our regulator we build on that cell type and we will display all data that is known for that cell type. We also have the option to add further data using track hubs and I will show you how to do that. So we're taking data from various sources, we're processing it to predict the activity, the positions and the activity of various features and we're predicting the activity in cell types. You can view this in the genome browser, you can access it via Biomart both of which I will show you how to do and you can also access it via our Perl API which requires a lot of installation and a lot of set up and a lot of programming knowledge and things in order to access. So we're not going to cover that here but if you want to talk to me about it I will please come and see me. So what we're looking at is raw data, we're starting off with our histone modifications with our transcription factor binding sites. So as Pauline mentioned we're looking at peaks and the signal that produces those peaks and what we're doing is we're processing it. So we're starting with something that looks like this and we're looking for regions that look like they might be interesting where there's quite a lot of stuff going on. And what we have is a blind pipeline which says okay, here's a pattern, here's a group of features that we see clustered together. Can I see that cluster anywhere else in the genome? Looks all over the genome, finds that we often see this cluster coming together and we also know that two of these are known promoters. If we know that two of the places where we see this cluster are promoters the rest of it is quite logically also promoters. So this is what our pipeline is doing. Of course it's not as simple as saying every time we see this it's a promoter and nothing else is a promoter because it turns out that this slightly different collection of features also promoters. Another slightly different collection is also promoters. So it's not so simple we don't say right these things are promoters, these things are enhancers because there are subtle differences between them. And we see that there are different patterns that all lead to the same kind of functionality. So this produces what we call segmentation regions of the genome that we believe carry out particular functions. We can then, because this is in a cell type specific manner we can then plot the segmentations all on top of each other and we can say well obviously there are differences. These are different cell types they're carrying out different functions and they'll have different activity. So what we want to take from this is based on all the different activity that we can see in all these cell types what actually are the different regions, what are they doing? So we start with a 200 base pair bin, we have a look at what's in it and we say well that region because we've got some promoter activity that region is a promoter. And we move along in these 200 base pair bins calling them as different kinds of activity as promoters, as enhancers based on the fact that we have some activity in some cell types. And I always, I need to make this animation move more quickly because I always end up with loads of extra time. Let's just assume that it's all finished. So now we can say we've got these features across the cell types and we can say these are the regions that are doing these particular activity. We've now defined our boundaries, we've now defined exactly where in the genome a promoter is, exactly where in the genome an enhancer is. Of course this is in 200 base pair boundaries, so in bins, so we can't say the actual boundaries seem to differ between cell types because of course we're working with things like chipset data which have fuzzy edges. We can then say well for these features that we've called are they active or inactive in different cell types. So within these specific boundaries that we've said apply to the whole, apply to the genome, we can then say will this enhance, this promoter is active in cell type one, it's inactive in cell type two and cell type three, it's active in cell type four. So we display active features as these coloured blocks, inactive as grey blocks with the hashed lines through them telling you what colour they are. Through this kind of analysis we've covered quite a significant proportion of the genome. So there's about 300 megabases which have been categorised as being some kind of regulatory feature by this analysis. So it's not quite the full set of everything that's been covered because obviously it's not quite the 80% that's everything that ENCODE has but everything that we have sufficient evidence to say is a particular activity is about 10% of the genome. So I want to show you how you can see this kind of data in Ensembl and what you can learn about it. So we're going to start by looking at the region of the gene and we'll see the regulatory features and we'll see the activity in different cell types and what evidence there is to show this. And we'll also have a go at a quick biomark query and we'll get the low sign functions and some regulatory features. So all of these features I mentioned have an ID that look like this. ENS telling you it's Ensembl, R telling you it's a regulatory feature and then an 11 digit number. So what I'm going to do is I'm going to go to the Ensembl genome browser which is at Ensembl.org. I'll just point out a few things on this home page before I start showing you some of the regulatory features. So what we've got is this blue bar along the top which is always going to be visible. You've got a search. You've got a link back to the home page. You've also got a login slash register here. So one of the things that you can do is you can set up an account where you don't have to pay any money and nobody will ever ever send you any spam email. It's very important when setting up an account anywhere. Which means that you can do things like save stuff to your account, share between groups, this kind of thing. We've got some links to what your favourite genomes might be so you can click through to human or mouse. And there's also a list where you can see all of the species. So there's about 70 species there. If you go to this viewful list you'll see all of them. Another thing I want to point out is it says here what's new in Ensembl Release 80. So one of the things that we do in Ensembl is we have a system of regular releases. So whenever we get new data, whenever we get new tools, instead of just putting it out there straight away, we hold on to it and we put everything out in one go which we call a release. We do these releases every two to three months and they all have a number stuck on them. What this means is when you get data from Ensembl you can write down in your lab book, got the data on Ensembl Release 80, la la la la la. Six months later you go back, you're looking through your lab book, you go to the page that you were looking at and you find that the data has changed. And you're thinking did I write that down wrong or has the data changed? The easy way to find out the answer to that question is to scroll down to the bottom of the page, click on this link that says view in archive site, and you get a pop-up listing previous releases of Ensembl that are available. So you can say aha, right, I'm going to go to Ensembl Release 80, I'm going to see what I was looking at back then and I will discover that what I wrote down is actually completely sensible at that time. It's just that the data has changed since then. So we're going to search for a gene. So we've got a search box here where you can put all kinds of things into this search box. You can put gene names, you can put go terms, you can put phenotypes, you can put coordinates, you can put IDs, they can be Ensembl IDs, they can be RepSeq IDs, all kinds of different things can go into the search box. I'm just going to put in a gene name, LIMD2. It takes me to search results, which I can narrow down if I want to. So if I've done a search for something like insulin, I'm going to get loads of results here. And I might want to narrow it down, I'll get insulin dependent and insulin interacting and all that kind of thing. I might want to narrow it down so I would just see genes and I would just see human. And then I would know I might get a much shorter list that I could easily search through rather than trying to handle the thousands of results I might get if I search for a word like insulin. In this case, we've got LIMD2 right at the top of the list. And I'm going to click on these coordinates here. So I can jump to the gene or I can jump to the coordinates of the location. So I'm going to go to the location. And we'll have a look at the region. So this is the region in detail page. This is probably one of the most accessed pages in Ensembl. So what we have is we have an overview of the chromosome at the top. This red box is us. This is where we're currently looking. We have a sort of overview of our region of interest. This is a megabase surrounding our gene. Again, we have a coloured box showing us where we are. And these blocks are all our genes. If you follow the leftmost point of a box down, you'll reach the leftmost point of a label telling you what gene it is. And this is where we currently are. I can scroll around. I'll just reset that back. Or I can switch to a draggy box. I can drag out a box around something that I'm interested in. What I would like to do because I'm currently looking at a gene because when we're looking at regulatory features we're normally looking at a gene plus the region surrounding it. What I'm going to do is I'm going to drag out a bigger box around where I currently am and jump to that region. So I have my gene plus some flank. So this is my region in detail, my sort of detailed view at the bottom. I've got my genes here. I've got coloured blocks showing me my coding exons, empty blocks showing me empty exons, lines connecting showing me the introns. So this is the gene that I search for, the LIMD2. It's negative stranded, which I can tell because this little arrow is pointing that way in the direction of transcription. It's also shown underneath the blue contig. So this gene here is positive stranded. It's above the blue contig. These ones down here are negative stranded. And as I scroll to the bottom I can see this track here, these regulatory features. So these are the features I was talking about in the presentation. These are the sort of blocks of activity with known activity. We also have a legend at the bottom here. If I hover over the track name you'll see it just tells me features from the uns... This is where I completely failed to... I've put on a giant mouse so it's easier for you to see the mouse but then it makes it really hard for me to aim. We have this link here to find out more about the unsymbol regulatory build. So if you wanted to find out more about what you were looking at you could go to that link there. But we have these coloured blocks. I can see that this red one here is a promoter. These blue ones are CTCF and this purple one is a transcription factor binding site. The black dashes throughout the features these ones are transcription factor motifs. So it's telling us this is a short sequence which is known to find a particular transcription factor. So if I want to find out more about this feature I can click on it so you can click on any feature that you see in a browser. You can go up and there's normally a link so you can investigate it more. So we have an ENSR link because this is a regulatory feature. We also have a list of all of those transcription factor binding motifs with links to Jasper matrices and scores as well. I'm going to jump straight to the stable ID of this promoter and we'll find out a bit more about this promoter and its activity. So this is the summary page for the regulatory feature and you can see a graphical display showing you which cell types it's active in. So here it is in the multi cell feature but we can see it's inactive in A549 active in DND41 and so forth. You could count these and sort of note down what they all are or you could look at this sort of quick summary here active in 6 out of 18 of the different cell types that it's active in. The moment when we have only 18 cell types it's pretty easy to just sort of see it visually but in the future we plan to have more and more cell types as more and more people produce more and more data in which case you probably want to go with summaries in a lot of cases unless there's something particular you're looking for. So we can see the cell types it's active and inactive in. So the two cell types I'm going to look at in a bit more detail I'm going to look at one where it's active which are these Huvex cells and one where it's inactive, the healers. And the way I can find more details is to go to this details by cell type link. So what this allows me to do is see more detailed information by cell type so it's really well named. So it's already showing me these Huvex cells by default but I want to see more cell types. So I have this button here that says select cells showing one out of 18 so I'm going to click on it and it will give me a list of the different cell types and I'm going to pick the healers. Note that I could also do all on or all off or do a filter which again is something that will become more and more useful as we introduce more and more cell types. So I'll just click anywhere outside the box to close it and now it will load up with the healer cells where as we already knew it's inactive. It's also showing me the evidence that was used to create these so the transcription factor binding the histone modifications and things that were used to create this and we list that as evidence but as we can see from this button here it's only showing 8 out of the 99 possible types of evidence that we have available. So if I click on select evidence it now shows me all the possible types of evidence which are categorised and I'm not going to go through and select them one by one I'm just going to turn everything on and now I save and close and now what we can see are the different kinds of features that are available that we can see within this promoter and we can see perhaps why in one case it was listed as active in the other inactive. So in the Hebex cells where we have this active promoter you can see that we've got open chromatin with this DNA's sensitivity we've got a transcription factor binding we've got some histone modifications that are markers of activity and we've also got polymerase binding generally a good sign that this is active compare this to our healers where it's inactive we're missing most of that. The only thing we've got is CTCF and DNA is binding which seems to correspond to this active CTCF binding site over here seems to be unrelated to this promoter as well. We mentioned peaks and signal I can also turn on signal if I want to using this button at the top so if I want to see the sort of maps with squiggly lines I can do if I prefer to see that kind of thing and you can see that the graphs support what we can see in these peaks. If you want to look over a whole region rather than just at one particular feature you can do so. So I'll go back to my location tab so we have this system of tabbing which makes it easy to jump between different kinds of features if you're interested in a gene and then start looking at a regulatory feature in the gene and want to look at a location you can jump back and forth using these tabs that will appear as you open up different features. Currently what we've got displayed in this region view is just the multi-cell regulatory features these features that are just saying this region does that this region does that doesn't give you any indication of their activity in the cell types but if I want to get that activity you can change everything that you can see in this in the browser by clicking on this configure this page menu and what this does there's about two and a half thousand different tracks that you can see that are available to you obviously most of them are not currently shown but you can get access to all of them by clicking on configure this page we have menus down the side so if I know I'm interested in variation data I can click on variation in the left hand side and I'll get a big menu listing all the possible tracks which are related to variation which can be one way that I could do things or we also have a finder track here so if I kind of if I just know the name of something and I'm not quite sure what category it's going to be in or it might be quicker in some cases I can just put in healer for example and now I have the option to add the regulatory features for healer cells that's also found in this regulatory feature menu so I'll add the regulatory features for healer and for Huvex some of the tracks that you'll add when you click on them you'll get a little pop up asking you what style you want to put it in you can mess around and read there is an FAQ that describes what all the track styles mean or you can just turn it on see what it looks like and change it if it doesn't look how you want it to look which to my mind is the easiest way to do anything so that's the regulatory features if I want to also see the evidence that's also available underneath the regulatory features in this menu I've got open chromatin and transcription back to binding sites and now we can see some more detailed data so down the side all the transcription factors that are available and along the top we have all the different cell types that are available if I want to turn on a particular track I can just click on the box it will go blue telling me that I have turned the track on when I do that it picks me a track style so there are some tracks that are already shown picked by default generally assumed that you might be interested in open chromatin in CTCF so these are on but they're not actually on until you pick a track style but when you pick something it then gives you a track style for that cell type you can also say well I want to see all of a particular transcription factor in all of the cell types so there's a select all option there and you can do the same with the the cell type so I can say I want to see all for the healers which is what I'll do and the same for the Hubex so it's picked me a track style in both cases it's picked me peaks so if I click on the track style options you'll see that I can have off which is what I've got for most of my cell types I can have peaks so these are the blocks I can have signal squiggly lines or I can have both squiggly lines and boxes I'm going to leave it just on peaks for both of them I can do the same if I go into histones and polymerases and it's exactly the same sort of matrix works in exactly the same way so I can say Hubex cells and healers cells oops I didn't manage to hit that everything is now on so now when I close the menu it's going to take a moment to load well that's loading no I'll show the data first so what we have now you can see we've got the promoter in the multi-cell so the multi-cell has kind of got this pale blue highlighting down the side I don't know if you can see it very well and then when we go into the Hubex cells we have this darker blue so you can see everything that's that cell type because it's got this dark blue band and the same with the healers it's got the green band so you can see everything if I decide to move something so one of the things that you can do is you can move things around if I want to see how these features match up to these genes over here what I might want to do is pick it up and drag it and line it up next to the genes that's something that I can do but of course now it's not next to the thing that says healer cells but it's still got that green and I can scroll down and go oh yes the green that's the healers that I've put on there so that's kind of how you can look at the regulatory features in the browsers I also want to show you track hubs so if I go back to the home page there's a link here that says ensemble supports data external projects through track hubs so I'll click on this and it lists a bunch of track hubs that are available as Pauline said these are data that are hosted at the locations that they originally come from but you can see here we've got hub names descriptions and it also tells us what species and what genome assembly they're found on so the main ensemble browser is based on GRCH38 the most recent human genome but of course we know that a lot of people are still working with the older genome GRCH37 and many of the hubs that we have are still have not been moved across and are only on GRCH37 because we know lots of people are working on GRCH37 we have a site dedicated to that which we'll see in a moment so if I click on the encode analysis hub which has not yet been moved over what it will do is it will take me to the dedicated GRCH37 site so you can see just in the background I'm now on an ensemble site but it says GRCH37 at the top the URL is GRCH37.ensemble.org and it's a different different shade of blue and now I'm in the personal data section it says encode analysis hub it gives me a link to where the data actually is and it also has options to save it to my account to share it also to get rid of it so if you're finding it's you've put on a bunch of hubs and everything is really slow then you might want to consider deleting them from your ensemble obviously you won't delete them from the world because it's making things a bit slower for you but if I go into configure hub it takes me back to the configure menu that I was in earlier it does get a bit slow when I've added this kind of thing because it is quite a lot of data and now I have encode analysis hub shown at the top of my menu and I can go into things like RNA signal and get these data I've got a matrix display very similar to what I had before but the difference is now these boxes have got numbers in them so we have a zero in the top corner that tells me how many tracks I've selected and a larger number in the bottom corner telling me how many tracks are actually available so if I click on this it then lists for me all of the tracks that are available and and you can see I can just turn them all on which I'll go for in compact or I could do them one at a time so now it's telling me I've turned on 12 out of 12 if I save and close and this is just a random region of the genome that it's taken me to it doesn't really matter it does slow it down aha so now I can see I've got my signal levels different tracks which I can hover over and see what they're the full names of them and things like that I didn't point out but it's worth earlier but it's worth looking out on any page that you ever go to for this little eye this info button looks like a tourist information sign if I click on it what it does is it opens up a pop-up which takes me through what's on the page lists, different kinds of features there are screenshots, there are labels there are all kinds of different things in this case there's also a video, there's not always a video it depends on the page but this one has a video and in a moment you will see Denise I haven't got my sound on so you can't hear her but she is chatting away and she's telling you all about what you can see on this page and the different kinds of things to look out for so it is worth looking out go on to for these little info buttons because they will take you through if you're ever confused about what it is that you're looking at I'm going to go back to the main ensemble page now and I said we were going to have a go at a biomark query as well so biomark is a nice little tool for data export so what I'm going to do is I've got these list of regulatory features so these are just their IDs and I'm going to get some data on them just remind myself of what data I wanted to get so I'll just copy these and biomark is found in the top blue bar and it's a really nice easy tool to use for data export so when I go into biomark the first thing I need to do is choose what database I'm looking at so if I open up this menu the data I'm looking at is going to be regulation data so you can see I can get gene data I can get variation data I can also access vega which is manual gene annotation data that is incorporated into the ensemble genes but you can look at it by itself as well so I will pick ensemble regulation 80 now I choose my data set so the regulation database is split into different sections in biomark so we can look at the binding motifs so these are your sort of short sequences that bind particular transcription factors we've got other regulatory regions so this is data from sources such as phantom 5 we have the regulatory evidence so that's the actual transcription factor binding the histo modifications and things like that we have the features which is what we're going to look at sort of promoters, the enhancers etc we have the regulatory segments so these were the sort of data that was used to construct the features and we also have the microRNA target regions so I'm going to go to the features and we're going to filter our data so now what we're currently showing what biomark has found for me are all of the regulatory features in the genome so what I want to do is to get biomark to filter it down so I can only see the regulatory features when I go to my results so I can only see the ones that I am interested in so if I click on filters I now get a bunch of stuff that I can use to narrow this data down so you can see I've got various region filters so I can use chromosomes, coordinates I can use chromosome bands markers I can use the pilot regions I can say I'm only interested in particular feature types or particular cell types so what I'm going to do is I'm just going to paste in I've got this regulatory stable ID and I'm just going to paste in my list now this will accept lists in comma separated, tab separated carriage return separated and I want to click on the little tick to say that that's my filters I've narrowed it down these are the regulatory features that I'm interested in if I now go into attributes it shows me the things that I might want to see that I might want to print in my table about my regulatory feature so I've got selected already by default chromosome name, start end and feature type I'm also going to add regulatory stable ID so we can see the IDs that we put in in the first place if I now hit results I get a table that lists all these features one of the things it's doing which is slightly frustrating is it gives me a new line for every possible new bit of data it thinks I could have in the case of the regulatory features this means that it gives me a new line for every cell type now I haven't picked any attributes that are based on cell types so what I can do to get rid of this is to click on unique results only and now it checks whether the line is the same as any other line or if it's different so now you can see I've got a different line for each feature it's just showing me a preview of the first 10 rows of my query the reason for this is loading time if I did a query that got me 100,000 rows as my output it might take a little bit of time to load if I were to then discover whoops I meant to put in another column there and I've just spent 10 minutes waiting for something to load I'm going to be really really frustrated so what it does is it shows me a preview of the first 10 lines which loads very quickly I can check I can see which columns I've put in whether I've missed anything out and if I have missed anything or I've added anything I don't need all my attributes and I can change it and it's all nice and quick once I've then found everything I want I can say I want to view all and in this case it's only a short one so I don't have a bunch of extra stuff but it I can then load it I can also download these data so I can export the data to a file or to a compressed file compressed web file which will notify me by email this is again really good if you're doing a massive query so if you download your data as a file it kind of generates the data and and at the same time it's sending it to you so it needs to maintain a connection with your computer the whole time it's generating the data if you lose connection for even a moment during that time what happens is it just cuts off the query at that point and just sends you what it's got so far and it doesn't give you any warning that it's done this if you pick compressed web file notified by email it does all of the the processing it creates everything within our system it doesn't need to communicate with you it only communicates with you when it's finished it sends you an email and it gives you a link to a file that you can download so if you're doing a big query that's the way to do it you can get these data as HTML as CSVs, TSV and as Excel a quick word of warning about downloading any kind of genome stuff in Excel I don't know if any of you have ever worked with oct 4 or sept 9 and attempted to put those gene names into an Excel spreadsheet and gone back later to see that you've got a date where you put your gene names so do make sure if you put anything into Excel to tell Excel to read everything as plain text otherwise you're going to find some mess in there so that's kind of biomarked just to know all of these are links so I could jump here and find out a bit more about this regulatory feature so we've got 25 minutes to have a go at the exercises so as has been mentioned all the materials are available on the meeting page and there's a link to the hands on tutorials where you will find a document that looks a bit like this this takes you through the walkthrough that I just gave so you could repeat that if you wanted to by yourself it also gets to a certain point da da da da da eventually where you've got some exercises that you can have a go at by yourself having a look at some of the features you can do these exercises you can have a look at your own regions of interest so there's some using the browser and there's some using biomark again it's up to you which ones you choose to do there's also answers for all the exercises so taking you through how to actually do them on what you should find with them so if you want to check the answers you can do that as well and then when we've had a go at the exercises we'll have a look at the variant effect predictor you can ask me questions about anything that relates to this what to doing these exercises or anything of interest to you about ensemble and ensemble regulation and I will be wondering round and wave at me if you want my attention and hopefully all of you won't wave at me at once