 things in the top here and I can't get to the tab I want. There we go. No, no this work. Now can you see it? Yeah, yes we can see your IO page. Okay, sweet. So if you click on that link that I just clicked on you'll come here and what we'll be using is we'll be using this this link right here. So these are the slides I'll be using and if you want to follow along there will be the still launching. So if you want to follow along you can if you've got your orchestra going you'll have in our studio that you can use. So my name is Jim McDonald. I am a member of Bioconductor and I have a lot of experience with a lot of the annotation packages and that's what we're going to be talking about today. So basically yeah it's a this is a beginner course so really our expecting is you have a basic understanding of our syntax and sort of a basic understanding of the NCBI resources that we'll be using. So NCBI or EBI. So just the basic idea of how you annotate things is sort of the expectation or understanding what that is. So we'll be using all these Bioconductor packages and when you start your orchestra our studio app this will automatically these will all automatically be loaded so you won't have to worry about that. So the idea what we're trying to do here is just to understand what sort of annotation data are available in Bioconductor. Understand the difference between some of the annotation sources. Get a little bit of practice on how to query these annotation packages which it's it's a little bit complicated for some of these things and get a little bit of practice doing that. So here's the learning objectives just basically be able to use a bunch of these things. So what do we mean by annotation? The basic idea is here on the left it's a sort of a functional annotation. You've got the central ID that probably doesn't necessarily mean anything to you. Maybe it's just an ID or something that's like ensemble 0 0 3 6 2 5. Not very meaningful and you might want to be able to say okay I've got this ID and I want to know something about it like what's the gene ID or what's the gene symbol or what's the protein ID or what's the gene name. So we have this sort of ID to these outer thing mapping but we can also do the other mapping which is not an uncommon thing like I had a recent data set that I was working on and all of the genes I had were all gene symbols and so in order to do anything with those data the first thing I had to do is I had to map those gene symbols to an ID and then that ID onto other annotations so you can have this bi-directional annotation going on. We also have positional annotation which again we may have this ID but then we may want to know something about where is it in the genome, what chromosome is it on, what transcripts does it have, what's the promoter sequence, a lot of things that are sort of genomic based and we we can need to do both of those depending on the analysis we're doing. So the specific goal that we have is usually we've got some data and the data usually the way it is is you've got rows or genes and columns or samples right and then so we do some statistics and so each row in this block of statistics this blue statistics block is like a t-statistic and a p-value or whatever for each one of these rows of data so for every gene we've got a t-statistic and a p-value saying whether it's different between these groups and then over here we may have some annotation information where we say for that gene there's the gene symbol, here's where it is, here's the go ID, whatever. So we've got this sort of mapping from the data to the statistics to the annotation information and we could do this with three data frames right and the only issue with that is then if you try to if you subset your data you do anything with your data you have to remember to do something with your statistics and then something with the annotation information to keep everything lined up. Another thing we could do is we could put it in a data container like an expression set where this experiment data is the first block over here where you've got again samples and columns and genes and rows and then we've got another data frame that's got the annotation data and then we've got another thing that's got the sample data so for each column we've got some information about the sample whether it was treated or control or the sex or the age or the weight or whatever kind of information we've got on that sample and if we put it into something like an expression set we can then have it encapsulated and have all this information about it. So if you have this running in your web browser you can actually click here to copy things and then paste it into your our studio if you want to like go along with me but if we load this little fake data that I've got in the package it'll it'll load this expression set and then if you just type the name of the expression set it'll spit out all this information about that expression set. So this expression set's actually got 33,000 rows in it and instead of doing something like a data frame would be able to do or it's just going to keep churning out this row after row after row it just tells you a bunch of stuff about it. So we've got 33,000 genes, six samples, the phenode data tells you things about the phenotypic data, the feature data tells you things about the genes, you can put in data about the protocol, about how you treated the samples and things like that and so it gives you this nice information that tells you everything data and you can extract information about it. It's like we can do this EXPR's argument will get out the and the the raw data which is this this chunk right here so we can get the actual data so this is the values for each one of the genes that we measured and these are this is the genes are in rows and samples are in columns. The phenode data tells us things about the samples so we know that these were treated with these silencing RNAs and these are with scramble so we know what the two groups are that we compared and then this phenode data tells us that it was about all the genes. Now the only reason I tell you about these things is because it's a lot easier if you're using that if you're doing analysis because a lot of the bio conductor tools expect this is sort of the input so if you're using Lima to do linear regression or something if you use this as input then your output tables would have all this information in it already and you wouldn't have to do this sort of chasing around of your annotation and your other data that's appended to your actual raw data. So the nice thing about these bioc containers is you get some validity checking you get you can subset these things just like it were a data frame so if you want to say well I just want the first couple of rows it will subset all the data in such a way that you don't get anything that mismatched you can run functions on these objects just as if they're a data frame and a lot of the data will come along with it and you get some automatic behaviors with them the downside of them is they're a little bit hard to create you wouldn't want to do it by hand and it's only useful within R you can't give one of these things to a collaborator you have to actually output the data but anyway back to annotation so we have several different types of annotation sources the first type is is sort of left over from the microarray days where you could get a bunch of data based on just a chip like an affimetric chip and so we have some chip DB's that have just information about all the things that run that affimetric chip at the next level of abstraction it's the organism level and we've got basically four different types of things for that we've got an org DB which has got functional information like the gene ID the goal ID the keg ID things like that we have these positional data packages that tell you things about how many exons the gene has where those exons are what chromosome they're on that sort of thing an organism DB is a combination of a couple of these where you can do queries that you do the queries between the databases so it's it's like a higher abstraction it's a little bit easier to do certain queries the PS genome packages are sequence data so if you get to the point where you need to know well what's the promoter sequence in Bracka one gene you can use these to get that promoter sequence we've got a couple other ones the go DB does gene ontology mappings keg DB is actually I should probably get rid of that we don't use that anymore and then we've got a couple annotation I mean online resources so the annotation hub biomark which are things that allow you to query data from online sources and it's it tends to be a lot more data there we'll get to all these things so the main way you interact with any of these annotation packages is using a function called select and the form of select is four arguments the package that you're going to query the keys that you want to query which is the IDs that you have the columns which is the things that you want to get out of it and the key type tells the annotation package what type of the key those keys are and for select if you are using what the what's known as the central key you don't have to you don't have to actually say what it is you can leave it unspecified but you can always use that so as a simple example let's say we want to just get you know let's say five IDs just sort of randomly select these five IDs that are from this afro metrics array and we want to get the symbol for all those those genes so you select we say the name of the the name of the transcript that the database that we're going to query it from the IDs which are these five numbers and then we want symbol and then it'll tell you that it returned a one-to-one mapping between the key and the columns and that gives you this data frame that's in the same exact order that you gave it gave to it so this is the first second third fourth fifth thing and so it's nice if you have an array or whatever you've got ordered call ordered rows of genes if you select you get back the data in the same order so you don't have to worry about whether things got mixed up right so you don't want to have like gene one annotated with gene 32 and the nice thing about select is it it returns everything in the correct order so and you can see here it just spit out this data frame and you've got the original ID that we gave and then the symbol that we asked for so couple of questions first how do you even know what the central keys are so if you decide you want to be with the cool kids and you don't want to specify what the the key was like what I hear I just said simple I didn't do that last argument so if it's a chip DB the keys of the manufacturers pro IDs and since this was an afro metric afro metrics chip DB I knew that and I'm using the probe ID is I knew that that would work fine sometimes it's in the name so if it's org.hs.eg.dbeg stands for entree gene which is what NCBI used to call their gene IDs now they're just called gene IDs also you can just do things like head keys name of the anti-package and then that will give you just an example of the keys for that annotation package and you can then sort of infer from that but it's not even necessary to do that you can always just say it's a probe ID or an entree gene ID or whatever and it'll it'll give you the right data so more questions and these are sort of more pertinent things what can you even get out of this thing how do you even know what you can get out of a particular package so if we use this function key types for this we use key types for this chip DB it tells us all of the different keys that we can query on so most of these are easily identifiable so entree gene ID ensemble ID ensemble protein ID ensemble transcript ID so it's pretty easy to infer these things symbol some of these other things are maybe not as easy to figure out like alias we'll get to that in just a second and then there's a function called columns that will then tell you all of the things that you can get out of the out of the database so it just so happens that the columns and the keys for this particular package of the same exact thing so you can map either way you can do accession number to go term or go term to accession number you can either way so there's one example with select and the issue with select is if you have one to many mapping so like if we take these three after metrics IDs and we use the select function again we use the IDs and then we say we want the symbol in the map and the map is sort of the karyotype position we get this thing where it says it returned a one to many mapping between keys and columns so you can see here we have three IDs and now we've got nine things that it spit back out so for this first one we just get one thing it's it's trap six and this is where it is the next seven of these things it's all the same ID and it's the same gene family as this dds 11 family but they're on different places one's on the PR of chromosome one one's on the Q arm of chromosome two somewhere in the pseudo-autosomal region so it's sort of a problem if we want to have this thing where we have for each row we have all of the data select remember with our data we've got samples and columns and genes in rows if we want to keep the annotation in this row wise fashion we now have like seven extra rows here for this ddxl ddx 11 gene to get around that there's a function called map IDs and then it gives you the control of these duplicate duplicates so it's got the same arguments to select only you have to tell it what the key type is so if we did the same thing as we did above but said we just want the symbol and we tell it's the probe ID so it's the same IDs as we used above now we just get a named vector where we've got the probe IDs as the names and the genes as the results and you'll notice here for this one this one six six five seven four three six this is the one up here that had like a continually different symbols we're only getting one symbol and it's just the very first one so the default for map IDs is just to give you the very first one which may or may not be what you want there are choices for multi vowels to for how you can deal with any multiple mappings so we can say we can use a list character list filter as NA or a user specified function I've actually never used a user specified function but you could do it I guess so if we do the same thing again and we say we want a list then we get sort of what we'd expect it's a name list with the IDs as the names and then any duplicated symbol duplicated symbols so we end up with this list where we've got all the duplicate IDs for this one probe as you know just one value in this list I you can also do a character list and we'll see later on in the presentation how a character list allows you to keep having the sort of one row structure with multiple things so if we did map IDs with character list we get this weird-looking thing like this where it's got all of the repeated symbols if we use filter it basically just says anything that's got multiples I'm just gonna get rid of so instead of having three things returned we only get two and we can also say as NA which basically does what you could argue is maybe the right thing to do which is to say it's not available because we don't really know what gene this probe actually measures I mean it could measure any one of like seven different things right or six different things so you could say well you know if it's able to measure like six different things that's the same as not being able to measure anything because I don't know what what this measure is so you could use NA to just say look I don't even know what this thing's doing so I'll seem pretty clear so here's a couple questions for you guys try to see how well this is sinking in so using either of these the that chip package that I used above or the hs.eg.dv package what's the gene symbol for entregene 1000 actually I'm gonna stop sharing that for a second I'm gonna share screen one so you can still see this right you see a e-max screen now yes okay so what's the gene symbol corresponding to entregene ID 1000 so we have so we want to select from this package and it's entregene 1000 that's another thing I didn't tell you it's even it's even though it's numeric it's got to be a character what I say I wanted from the symbol so the gene symbol for entregene 1000 is cdh2 pretty straightforward right well it's the ensemble gene ID for PPR gamma so this is actually a shortened workshop so I'm gonna just keep going on these are couple things that you guys can try to see if you're picking up on this stuff so the txdb packages and the ensemble db packages are as I said before this is positional data and what it contains you can infer pretty much by the the package name so txdb means it's a transcript database then it's the species the source the build in the table so txdb.h sapiens.ucse is Homo sapiens it's from UCSC it's HT19 which is their version of GRCH37 and it's from their known gene table same thing with ensemble db packages it's those are it's an ensemble db it's reticent or vichicus and it's ensemble version 79 so these things work very similar to how the org db and the chip db packages work is you can use select on them and you can use map IDs on them so you could do something like select 1 in 10 these values and the gene ID from the gene using these are gene IDs so we want to get the transcript start and and the chromosome and it'll spit out this data frame that's got the gene ID the transcript ID the chromosome to start in the end same thing can happen with ensemble db you can do the same thing but that's not really what we would normally do with them so the nice thing about these is you can just get instead of asking for so we did before as we said here we got these two genes give you the information on that we could do the opposite where we can say just give me all the information so here we say we want all the genes from the txtb Homo sapiens known gene package so if you run that it'll spit out this thing called the G ranges that tells you all this information so ensemble gene I mean entry gene number one is on chromosome 19 from this position to that position on the negative strand right and so there's 23,000 of these things here and it gives you this truncated output and you can you can basically sort this I mean you can you could then like subset this on whichever genes you wanted G ranges list does the you can also get a G ranges list you by doing something like this so if we say we want the transcripts by and inferred in this is we want it by something so if we do transcripts by the default is to do it by gene so what we're saying is I want all the transcripts for every single gene and so it'll make this thing called the G ranges list where this is gene number one and there's two transcripts and this is where they are goes on 19 on the negative strand and here's their transcription so doing anything more about this is sort of beyond the scope of this workshop but I had to show you this part because this is very useful for doing annotations and sort of subsetting your data and getting chunks of the data that are useful for you so you can get transcripts genes coding sequences promoters exons you can get them all grouped by element so we can do transcripts by and get things grouped by gene or by transcripts I mean so you can get it by gene exon or cds so why would why do we even want these things now the reason we want these things is it because it allows us to subset things based on genomic position so remember I was talking earlier you may want to know something about things that are in a gene region so you may have a bunch of you may have some evidence that this particular region of the genome is highly differentially methylated between two different groups well the next step the next thing you might want to know is what genes are even there what promoters are in that region how what's the closest gene to that region so you need to be able to take and subset data based on position and also get information about what's in that that region so one thing we can do is you can subset things using this over function so up here what we did is we got this transcripts that's got all of these different transcripts all these different genes and we also got the genes right so we've got two genes here gene one and gene 10 and we can subset this thing based on the position of those first two genes so if we do this thing right here we're just using the regular bracket like what you'd use with a data frame so we're taking that transcript g-ranges list and we're subsetting it to the things that overlap those first two genes and so we get the obvious what you would expect is gene one and gene 10 because that's the first two genes in that list but we also get this other gene that happens to be in that same genomic region so we wouldn't have known that I mean necessarily but we wouldn't have known that without doing this right so that's just a nice way of being able to say well you know what's in this region so the use cases for this is looking at gene expression changes near differentially methylated cpg islands so you could do something where if you've got methylation data and we know okay this region of the genome has got differential methylation what genes are in that region maybe you've got gene expression data what genes are in that region within like a megabase or whatever and which one of those are differentially expressed you can look for genes that are near hypersensitivity clusters you can look at the number of cpgs I mean just there's lots of use cases for this where you've got sort of this positional information and also genomic information and you want to be able to subset the one based on the other most of the data that's got this positional data would go into something called the summarized experiment where it's similar to the expression set where we've got information about the samples in one data frame we've got the experiment data with the same format with the gene the samples in the columns and the genes in the row and then our annotation data would include one of these g-ranges objects and then we can subset that based on the other information it's like if we said I know that there's a differential methylation in this region and I've got this summarized experiment that's got information on the genes that are in that region then you can just subset out those genes and look at them separately so for the transcripts how many transcripts does ppr gamma have according to ucsc so and then the next one is does ensemble agree so this is sort of an interesting thing and something that people don't necessarily expect so let's take a look at this so we already know ppr gamma has got this ensemble gene id right so if we also do so we can get the untrayed gene id for that as well right so then if we do so if we get the transcripts by and then we subset that and say so for this gene 5468 which we know is ppr gamma we know that there's 10 10 transcripts according to ucsc and this is where they are they're all in chromosome 3 between they're all pretty close to each other right but if we do all right sorry you're loaded so we'll just use this version 79 so if we look at the ensemble so remember ucsc says there's like 10 transcripts for ppr gamma ensemble says there's 14 of them b of which so one is this retained intron and two are nonsense mediated decay so it's a different biotype but still there's this sort of fundamental disagreement between ucsc and ncbi and ensemble as to how many transcripts ppr gamma has so one of the things that people often try to do is do this mapping where they've got ensemble id's and then they want to map them to ncbi ncbi id's and that can be problematic because you can see just right with that there's fundamental disagreements with a lot of genes between the two annotation services as to how many transcripts they are where they are what type of transcript they are so it's usually not in your best interest to try to do those between annotation service mappings instead it's better just to say i'm going to either do ensemble or i'm going to do ncbi and just stick with that the whole way through so i'll leave this one for you guys to do on your own so we have a new package that um we built like a couple of releases ago to do orthology and we're just going to talk about that for just a second so one thing that you may end up doing is not sort of this mapping where you're doing some gene id to the go terms or the kabe terms or things like that you may actually have this interspecies thing where you say look i've got a bunch of mice genes and i want to know what information we have in human for those genes i want to sort of do this orthology mapping between species um so with this orthology package you can do that and it uses select just like all these other ones do and so it's just the same thing so orthology it's the name of the package and then here i'm just getting the keys out of the um human database so i'm getting here i'm getting all of the entree gene ids for uh homo sapiens and i'm saying i want zebrafish so i'm trying to map all of the human gene ids to zebrafish and we get like 66 thousand mappings um a bunch of which are na and after we get rid of the ones that didn't map that came back to there's no mapping we have about 10 000 genes where we're mapping between human and zebrafish and if you look at the the top of this data frame that we got out of it basically you've just got this mapping where we're saying entree gene i mean ncbi gene id 14 maps to this ncbi gene id so it just makes it easy to do this sort of mapping between species if you ever run into that situation where you have to do that there's actually a lot of species in here there's like almost three hundred and seventy oh wait it's a lot more now so we get this from um ncbi and i've actually obviously upped it so there's 452 species in there now and you can sort of you know it's all just the species name so it's species dot genus and then if there's uh i don't know what they call that whatever so if it's like um there's the like different rhinos or whatever you can find the species that you're looking for and then do do the mappings that you care to do so the organism db packages one of the things that's hard to do is if you're doing something with a txdb and you're doing this positional mapping so you've got the gene id and you're mapping it to um how many exons it's got or whatever you've got this information that's all positional in nature but if you want to know but what's the gene symbol or something like that that's not in that um transcript database so it's hard to do that so if so if you want to do this cross mapping of positions and functional annotations it's hard to do so we've developed these things called organism db packages where we've sort of stuck all of these um database into one package so you can do these cross database queries so if you load the homo sapiens library and then just do the show method which basically will tell you what's in there it shows here that it's got the go.db package in it the org.hs.eg.db package so we've got the functional data and we've also got this hg19 known gene one thing about this is the these positional packages are linked to the genome build so if you're working with hg19 data then this is what you would want to use but if you've mapped all your data using hg38 obviously you wouldn't want to use this and so you can update it so there's a function called um txtb where you can basically update it and say that the txtb is now the hg38 so you can change the db object the columns and the key types now if you ask for the columns for this homo sapiens you get all of the columns and all of the key types for any one of those databases that are underlying that package so we can get data out of here it's like if we wanted to get all of the genes from homo sapiens and we wanted to get the entree gene id the alias and the uniprot id we just ask for this columns argument and then when it spits it out here's all that extra data so and it's this is the same thing that we got above from the txtb but now we've got all of the aliases for that the two uniprot IDs that map to that and then we ask for the entree gene id which is sort of silly because we already get that but it's sort of nice because then you can do this thing where you get all of the data all the annotation along with the um with with any of this functional data you can actually have gene ontology terms here and other things like that and we're talking above about how the character list is a useful thing and this is where it comes in to play in these g-ranges objects you can have so you can see here there is at least three um aliases for this particular gene and aliases are basically just symbols that are deprecated so we could have said symbol and it would just be whichever the official current symbol is so if we use the organism db how do we get all of the go terms for bracket one so we could do basically the same exact thing that you would have done using just the org dot um the org package but using the homo sapiens um so what gene does the transcript id this one map to now this is sort of tricky because this right here is a transcript name that only comes from ucse so ucse at least up to hd 19 gave all of the transcripts these sort of random looking IDs so if you had the transcript that you got from um using a txdb package and you said well you know what is this thing this is how you would do that so this is something that you couldn't have done using just the txdb package itself but using a homo sapiens package you can then map the transcript name to the entree gene id to onto the symbol so how many other transcripts does that gene have so now we know that it's znf 23 so we could also do we can also get the entree gene id so it's 7571 and then we could do something like um right so here's this uc 002 f a 1.3 so if we had been working with that txdb we would know that well we've got all of these transcripts but we didn't really know and we could have just mapped this right but if we had just the transcript id then we we could have used that with the homo sapiens package to map it to figure out that it's znf 23 and any one of these should have mapped back to the znf 23 so we could get all the transcripts from the hg19 genome build along with our ensemble gene id ucsc transcript id and gene symbol so if we do let me see ensemble ucsc gene symbol so we just do the transcripts and we ask for these three things the ensemble gene id the symbol and the transcript name and if we do and so we get this thing where it says select return to one to many mapping and that one to many mapping is reflected in the transcript name and the ensemble id and the symbol so we're getting a list of id's so a lot of these are just just single but there's obviously some of these where it's multiples so the one thing with these organism db packages is they're not simple to put together and so we've only got three of them that i know we've got human mouse and rat so in order to sort of make it easier to have multiples of these things there's a different type of package called a organism dplier package and what this does is it combines the data from the txdb and the org db into a local database and then you can use the tidyverse dplier package to make queries to these underlying databases so it makes it a little bit easier to put these things together and be able to make a lot of these queries although you can't do the gene ontology you can't add the gene ontology stuff in there so it's got the same functions that that we have for the organ the txdb is key type select promoters exons everything like that is the same so you can you can use these basically the same way so if we load up the organism dplier package normally you would do something like this where you say source organism is and you would like do a an actual txdb for purposes of this workshop we're just going to use this little cut down thing that hd38 light which hardly has anything in it which makes things go a lot faster so if we do this source organism and say where the the thing is and do the show method it'll tell us that here's where it is this is the thing that we're querying and these are the tables in that so they're a little bit um well they pretty much the names pretty much tell you what's in it so it's some id table and then it's got like mappings to accessions to gene ontology to g go all the difference between mappings to go and to go all if you know anything about the gene ontology it's the recta basic like graph and the idea between the idea with gene ontology is you can have genes that are mapped to a certain gene ontology term and by definition anything that's mapped to that term is also mapped to any of the higher level terms as well so if you if you have a gene that's directly mapped to something like and here i'm just making something up but let's say there's a go term called protein phosphorylation right um that would also be mapped to if it had a parent term of protein modification so the go all gives you the direct mapping and any sort of any mapping to any higher level go term so normally when you're doing like a hyper geometric test you would need to go all because it gives you all the direct and inferred mappings um the rest of these protein id's transcripts um coding sequences exons genes transcripts so to use this we can do just like what we've done before so we can say we want the promoters from that database and it'll do exactly what you'd expect which is it puts out a gene changes object that says here's all the promoters it's only 111 ranges because like i said it was this little cut cut down thing um but it gives you these are all the promoter sequences for all these different genes so you can so for this ensemble transcript id the promoter region is inferred to be in you know in chromosome one in these two positions between these two positions on there um so we can we can do that we can also look at each one of these data each one of these tables from the underlying database like when we looked here it listed all these different tables so we can actually just get those tables and look at them so if you do this table function and you say you want id table it'll just basically spit out just a chunk of that table so you can sort of inspect it and see what's in it so the id table which is sort of the main table has got the ncbi gene id the cytogenic position the ensemble gene id the symbol the gene name and then the alias and interestingly you can see here this is basically a fully normalized database table the only thing different in these first five columns and these first five rows is the alias so like i was saying before the alias are names that were given gene symbols that were given to this gene in the past that have been deprecated so if you work with sort of an old school biologist they may call a 1bg gab or his 2477 so you can run into these situations where you get a bunch of gene symbols from a collaborator and maybe they got those gene symbols from an older paper that older paper may not call the gene a1bg it may call it abg or gab or his 2477 so you can use the mapping any one of these mappings to go from the alias to the official symbol and then from the official symbol to the ensemble gene id or to the entree gene id so another thing that you can do with your organism d plier package is you can make complex queries between tables so the upside of this is you can do complex queries the downside is complex queries are complex so this is all um d plier syntax so an inner join is basically you're saying give me link up these two tables it's like the merge function in base r so if you say i want to merge these two two data frames and merge it on these two columns it will just give you back the merge data frame where both of those column there's matching values in in those two columns you looked at right so if you've got one column that's got one through ten and the other columns got two four five six and you merge on that you'll only get the rows with two four five six right and so if you do an inner join it does the same thing only here you don't have to say what you're going to join on you just say i want to do an inner join on the id table and the range is gene table and what will happen is it'll tell you as part of the output what it used to do that inner join so it turns out that the id table has an entree gene column and the range is gene table has an entree gene id and what happened is it makes a bigger table that's got it's merged them based on just that the entree gene id then you can use these McGridder pipes to filter this bigger table that you've made based on different things so you can filter saying look i just want the symbol that's either a da or nat2 those are the only two genes i really care about and then after i do that i'm going to use this select function from d plier and if you're not familiar if you this is a fully qualified um function called so if you have a select function that's in a package that's higher in the hierarchical order of your packages and you call that on something and it's like a d plier is lower in the in the package stack than another function another package that's got a select function in it you will often get an error so if we just said select here it would probably not call a d plier select it would call a different select version and then it would give you an error saying i don't even know what you're trying to get me to do so if you put d plier two colons and then select you will get the select function from the d plier package so we put in here d plier select and then we say that we want these what seven columns doing this select presupposes that we already knew what columns were available so like we did up here for the id table we know that there's these six columns in the id table we could have done the same thing for the ranges gene to know that there's these different things right so if we do um if we do that then we see that it's got gene chrome gene start gene and gene strand and entree right so when we do that inner join thing it's going to make a big table that's got all of these columns in there so we know that we can ask for just any of these names so once we've done that this complex query we end up getting just the gene start the the chromosome start and the strand and the symbol and the alias and the map and again you can see here map two has had lots of different aliases in the past these are all the same exact position right these first four things those that's all the same exact position the only reason it's duplicated is because of these aliases so this first question here is a bit of a trick and the reason it's a bit of a trick is because i didn't tell you about it so how many supported organisms are implemented in organism by d plier so one trick you can do to figure things out if if you don't know what functions are in a package so if you use this search function it'll tell you this is my package load space so i know organism d plier is the fourth thing so if i do this tells me all of the functions that are in the d plier package and if i look at this and i do supported supported organisms so that looks like a thing that we might be able to use so if you do it actually works so supported organisms with no argument tells us there's 21 different um organisms that are supported um ah that's a good question abby so when you're making a summarized experiment it's like there's two ways you can make a summarized experiment um there's a way where you just like make it by hand right so you could just do um so you can just do um so we can do a summarized experiment and we can say um right so we're going to put a matrix of data in and then we can say um and then now to put the data in there we've used this row ranges argument i don't know if you can see what's in the bottom here but so there's an argument which is row ranges which this isn't going to work because they have to they have to match up i won't be able to just sort of fake one up because one of the things with the summarized experiment is it's got validity checking oh yeah the tx meta package but so you can make them by hand like this but usually what happens is you're going to be using something like de seek so let's say you're doing r and a seek analysis and you use tx i meta to import your data or even just tx import then you can as part of de seek you're going to make a summarized experiment and then you can put the data in there as the row ranges it's like right so once you've got a summarized experiment you can add it using the row ranges argument or like mostly what happens is the packages you're using like de seek will automatically make it for you and that's usually true of an experiment it's usually true of all these bioconductor packages i mean these bioconductor objects is you usually don't make them yourself because it's tricky because there's this internal validity check where if the rows don't match up so if the the idea is that all the the rows of your data have like gene ids let's say and then the row ranges are going to have gene ids too one of the checks that are done is to make sure that all of the rows of your row ranges the gene ids match up with the gene ids of your data matrix and the same thing is true of all of your phenode data that's um telling you something about the columns you have to have rows in there that match up with the column names and if you're trying to do it by hand and you get one of those things off it'll say it'll err on you it'll say look that's not valid blah blah blah and so usually you use like de seek or something like that to build it so how many supported organisms are implemented in organism declier it's 21 right hey jim there was a couple of good questions a little farther up in the chat uh so these chats were just popping up for me but now let me see if i can uh chat okay oh wow there's a good jillion in here i'm sorry guys i didn't i didn't have this opening you're a did you get bio did you get the package installed because jenny's already been good and answered all these things aside from looking up all annotation data packages and bioconductors there are resources that describes essential annotation databases that we can easily well we're going to be getting so alex your question about like finding all the stuff so most of the things that shown you so far are actually packages that you install right so they're installed in your your version of r um we should probably get moving a little bit faster there's a whole lot more packages that are on this thing called the annotation hub that you have to query and look for so we'll talk about that in just a second how to get protein amino acid ridges okay so urie we're going to get to your question in just a second because we're going to talk about bs genome packages so the bs genome packages are basically just the sequences so um and there's a ton of so if you load the bs genome package and then look at available genomes there's a bunch of different packages you can get and it's basically of the form the name of bs genome and then the species right so uh this is b ncbi and the and the b version um so what we're good for is the one thing is to get the sequences out so if we just load one that will just do the bs genome hg19 version and then if you just do the show method it'll show you all the stuff that's in it so it tells you it's hg19 it came from ucse and it's got all of the sequences so it's got the like the regular chromosomes and then all these ultimate haplotypes and stuff so the one claim to fame really for the bs genome packages is get seek which is what you do to get the sequence out of the thing so if we just say get seek on each so another thing i didn't talk about so you get the short name with it right so we could have just typed this whole thing but it gives you the short name so h sapiens is the short name for the contents of this package so if we say get seek from the h sapiens package and we say chromosome one it's going to spit out a 249 million long dna string of all of the sequence for chromosome one which obviously is super boring because the tails are all ends and i mean you can't look at that much right but we can also use a g ranges object so this is the thing from genes that we looked at above this is just a chromosome in position g ranges object and we're saying give me the sequence for this so if you had a g ranges that was the promoter sequence for a gene or a bunch of genes if you had a genomic ranges then you can use get seek from the h sapiens and it'll give you all the promoter sequences for all those genes that you put in this thing and it comes in a dna string set and you can see these are nice because it's this is even this is 85 thousand bases long and it gives us dot dot dot in the middle so it's blowing some huge thing out but and it gives you the the entree gene idea over here so you know you can get longer shorter dna string sets and you can do different things with that that's sort of beyond the scope of this but um if you need to do string stuff if you need to get the sequences this is basically how you do it so all of those things that are installed uh the dna sequences that vary between individuals so we are just we are just packaging data that comes from online sources so maywood's asked what happens with dna sequences that vary between individuals so this is the reference genome so the reference genome doesn't vary between anybody you can also i guess i'm skipping over the fact that we do have snip data as well where you can get for each physician what different what are the different variants that that are known to be there and what's the minor allele frequency so there's that as well but this is just the reference and i don't know how they decide what what's the reference yeah so it's just the reference sequence there's no there's no variability in it so annotation hub is an online resource that has um so if we just load the library the annotation hub library and then get a hub by doing this the sanitation hub with no argument it will download a bunch of data it'll populate a little database that tells you all the stuff that's in there and then if you do the show method it'll tell you there's like 65 000 of these things there's like a literally a good chili into these and it gives you a little bit of information about it data providers so you can see lots of different places lots of different species lots of different types of data um hexonomy ids genome description there's there's just tons and tons and tons of data here so 65 000 things it's kind of hard to the trick here is kind of like when you're trying to figure something out when you're using google the best way to do things is to do a good query and the best queries are based on the data provider the data class the species and the data source so if we do this m calls on the hub it'll tell us what all of these um so it's basically a big data frame and for each for each column it for each different thing you can get there's a title where it came from what the species is the tax id the genome all this information so we can then look at each one of these columns to see what sort of stuff is in there so if we look at the unique values of data provider you can see six a trillion different things right the big ones would be like ucic ensemble in paranoid but i mean if you're working with non-model species or things like that you can see there's a lot of stuff on the db snip there's i mean what the unique there's 73 unique types of things that are in there the data class tells you what sort of data they are and so a lot of these are obvious like a g range as we already talked about that you know what that is um a two-bit file is a binary sequence so if you wanted to get basically a genome sequence for something that they don't we don't have an actual package for it you could probably get a two-bit file for it and turn that into a bs genome package or just actually query it as if it were um but you can see there's lots of different file types here um lots of different species there's like 28-22 species that are that have data on here and um the source type is this is a little bit more helpful for some of it like in paranoid is a is homology mapping ucse track is going to be mostly um information coming from the ucse genome browser obviously so ensemble ids from ncbi uniprot from ncbi um yeah like i said there's a lot of stuff here so if you wanted to do a query you use the function query use you know we have this thing called hub that you're going to query on and then we're going to ask for three things we want a g ranges object we want it to be homo sapiens and we want it to be ensemble okay and none of this is unlike most of our it's not um case sensitive so you don't have to capital is it correctly or anything like that you just type it in a lower case and we'll figure it out so if we do that query for human g ranges objects from ensemble we get 109 of these things and if you look at them it just gives you this little truncated thing where it shows you the first five in the last month which is useful in as much as it didn't put 109 109 things there that you can't really see but it's not useful in as much as there's a lot of things in between that and that that you don't get to see so one thing that i do sometimes is if i get something like this there's a gajillion things is i look at the source url so if i do query and dollar sign source url it pulls out the that one column of where it came from as you can see here we've got 109 things when i first put this this workshop there weren't 109 things so it wasn't stupid to print all this out but now there are and it's kind of stupid so anyway which you can see at the beginning here we started out and we've got um these first five things that all comes from ucse and you know that because it says ucse in the name then the rest of this it's all mostly just ensemble and so this is ensemble grch 37 release 70 so ensemble has got release after release after release after release so you can see we've got one two three four five six seven g ranges for grch 37 for all of these different releases and then we go to grch 38 and then there's a bunch more releases and it keeps going up and up and up and up the only thing that's changing here mostly is that there's you know this is a different sort of gtf but mostly it's just all the genes for these different builds so what i decided is i'm going to get this one mainly because i was having problems with that i wouldn't download this one yesterday so what i want is query and i sub i do this thing and so it told you to do that up here so this is that a h thing and it says right here retrieve records with object a h whatever right so that's what i'm going to do so down here what i want is query and i'm going to do double bracket and in quotes that h 98495 and what it'll do is it'll download that grs this gtf from ensemble and then i'll have a g ranges object and the problem with this g ranges object is it's got a lot of information about exons and cds and things like that and so it's not super useful the way it is but what you can do is use this function make txdb from g ranges and it'll turn it into a txdb which we already know how to do things with so we do that and it makes a txdb object that's got um like 244,000 transcripts and it you know just gives you all this information about it um so questions how many resources are annotation hubs for atlantic salmon so we sort of already know how to do that right so we do um what did i say many resources so we just query on that species and there's like 64 things here and these are different things so like this ebonitio gtf is i have to go and look um it's a special kind of gtf so this this one right here is just like the regular one it's like all the genes and this ebonitio i think has gotten more genes or something but anyway so you can look at these things and say you know select here this is an ensemble on ensdb so you keep that and you already know what that's going to be um there's a version 106 gtf there's this cd cd in it all two bit so this right here is going to be sequences for all the transcripts and then it's going to have this non-coding thing so it's going to have all the sequences for the non-coding um genes so i mean it's pretty cool there's lots of stuff there it's a little bit tricky to sort of find things and it's a little tricky to figure out what's there but usually if you're working with non-model organisms especially there's a lot of data there that you can get that would be sort of hard to get otherwise and it's nicely packaged for you um so give the most recent ensemble build for domesticated dog and make a txtb we already showed you how to do that for for humans so you know at your leisure go ahead and try that so the last thing we're going to talk about is the biomark package so what the biomark package does is it allows you to query the ensemble data so everything we've talked about so far is mostly ncbi based except for the ensemble db packages but most of this is either ucsc um sequences and positions or ncbi g9d mappings but if you've got ensemble g9d's and you want to want to map those to genames or symbols or whatever you wouldn't want to use like an org.hs.eg.db package because that's based on on untrained g9d's you'd instead want to use biomark and the way it works is first you load the package and then you can list the marts to see all the things that are available use this list mark function but the way it works is you first um use the use ensemble function to make what they call a mart and the mart is just basically a connection to the to their database and then you can list all the data sets that are available so there's a gajillion of them so but you can list them and see oh look i work with eastern happy so i want to get this particular one or i'm i'm a gold legal person i'm going to use this one whatever um but we're going to just do something simple and trivial which is to use hosapiens so to set up a mart to use hosapiens ids we would use use ensemble the first argument is ensemble the second one is h sapiens underscore gene ensemble and i don't know if there's this huge gap there but pretend it's not and then when we make a query it's we use a function called get vm and it's got four arguments attributes filters values and mart the attributes are the things that we want to get the filters are the types of ids that we've got that we're going to give to the bio mart the values are the actual things that so the filters are the type of thing and the values are the actual values like if we if we say we're going to give it ensemble gene ids the filter would be ensemble gene ids and the values would just be a character vector of ensemble gene ids and then the mart is just the mark that we're going to query so as far as the attributes and filters a lot of them are kind of they're inscrutable so you have to know what they are so you can use list attributes and list filters to get this list of all the different choices so like if you have ensemble genes it's ensemble underscore gene underscore id so that's that would be your filter or your attribute if you have ensemble gene ids that are like ensg long number dot one or dot five or whatever that's a gene id with a version so you would have to use ensemble gene id version right so this list attributes and list mart tells you basically all of the things that you give as attributes or as filters so if we just wanted to do a really simple example query here's just like four afi ids that i chose and we're going to say we want to get the hgnc symbol which is just the basically the the gene symbol um and we're going to tell final art that these are afi hg u95 av2 um gene symbols and afi ids are the things that we're going to give it and then we already set up the mark so one two things to note here first thing the attribute so the filter is afi underscore ht underscore u95 av2 tinny's got a hack that's not such a bad idea um so the trick here so you look at this this filter and this attribute are the same thing it's the afi hg u95 av2 identical they're fine sometimes they're different the other thing that's a trick with this is you're querying a database and a database there's no guarantee in the order that things are returned to you so whenever you use bio mart always put in the attributes things that you're going to get back the same thing that you're using as a filter because you can see here so we did this get vm and we asked for the attribute back the symbol we said it was afi ids you look at this so the order here was a thousand a thousand one a thousand two a thousand seven what we got back was a thousand a thousand seven a thousand one and two thousand twos so the order on this is never guaranteed to be the same as what you gave it unlike select and all of these other functions that we looked at before so you always have to give it this thing so you can then filter and then of course you'd want to get rid of this duplicate here but then you'd want to use like merge or um match to reorder this to where it's in the same order as the data that you originally started with right so you'd want to match this column to the afi ids to reorder it and we're about out of time so the bio mart exercises here are relatively simple you can go ahead and try those um i'll be around so if you have questions feel free to ask um there's a on the main website there's uh what do they call it it's the meet people or something connections it's like you can send me a connection whatever and ask me a question we could do that um or you can ask questions on the support site i'm usually on there and answering questions so does anybody have any other questions for now okay cool you can use select with bio mart i did not know that is that new jay yeah um i think it's at the bottom of their vignette that they added that in but you still need to do their attributes like you have to put it both in as a filter and an attribute to get it back out right oh right so if you so their select is doing what select does normally it reorders to make sure that the order stays the same um i don't know if the select reorders but i just know that you don't have to i switched over to teaching that oh look you use it with select and columns and key types and everything instead of the get bm i think it's at the bottom of the bio mart vignette that they put that functionality in so if we do the other issue with bio mart and with the annotation hub is it's online data and being online sometimes you can't access it so like you'll do what i'm doing right now and you can see it's taking forever sometimes what'll happen is it'll try to get mirrors and sometimes it'll sometimes it'll just take forever and sometimes it'll just say yeah i couldn't do it and if it does that oftentimes you just have to go back again try and yeah you can see what's happening here so it's trying us west it's probably going to try us east there we go okay so what is it it's the same it's just like that jenny i forget off the top of my head i'll have to look it up if anybody's interested you can much if we can share things for everybody outside because i know workshop people are going to have to start moving to the next one anybody in here any questions okay well i will share this with jen maybe next year he'll put it in to change it otherwise let's thank jen very much for the workshop and we'll see everybody at the rest of the conference bye hope it was helpful see you jenny