 So welcome back everyone welcome to the third and last part of today's lecture Today or have we talked a lot about the databases different databases that are out there? We've talked a little bit about how databases are structured and why they are useful and why they can be useful in our and I want to spend like the last 50 minutes or somewhere around that that we have to Show you how you can get your data in R, right? So we want to talk about biomark and of course if you want to have data in R And this data is available online and then of course you can manually search create your Excel file But this is kind of manual slave labor because you have to just copy paste stuff from from the website into your Excel file and kind of do it yourself and I've did this many many times right sometimes you run into a database where the database itself Does not provide a different way of getting your data Yes, so the data is in their database and the website is presenting you the data But it presents it to you in a way that you cannot easily bulk download it And so the problem here is that it's very error-prone You can download bulk data via FTP, right? And that is usually the Way to go when you are doing bioinformatics is just go to things like ensemble go to the FTP site and say well Give me the GFF file So the gene description file and then had this file will contain all the genes on the genome Of course, this has less chance of creating errors But then the problem becomes is that all of these databases they store their data in a slightly different Format in a slightly different different structure and because they do that You end up with different data formats on your hard drive. So you downloaded the file from ensemble You've downloaded some files from tremble And now you want to start merging this data in in each other and you really can because you have to Load in the data in different ways because the format is different have one will have more columns Sometimes they're they're it's not really obvious what the what the primary key is or what the what the foreign key constraints are and so one of the best ways that I have found to get Biological or biomolecular data into R is to use bio mart and bio mart is one of these tools which I Use all the time So have bio mart can be used to retrieve your data directly into R And that has as an advantage that if the database gets updated or if genes get And new genes get discovered and added to the genome or new proteins get discovered and that when you rerun your analysis Automatically this analysis is updated and that's one of these things that an FTP server doesn't do you download the file at 20th of January 2021 and when you redo your analysis in in March or in April Had then the data on your hard drive hasn't changed So you have to either go to the database again and again download this whole big file with data And that's one of these things where bio mart really really shines it really gives you a very Nice way of keeping your analysis up to date with the data, which is currently in databases So bio mart allows you to query databases from our So bio mart describes itself as a community-driven project to provide unified access to distributed research data to facilitate the scientific Discovery process and it contains most if not all Biologically relevant databases so all big biological databases have a way to be queried via via bio mart Biomart besides being available for are it's also available for things like pearl for python You can query it using soap and rest and you can even query it directly from an XML file So it's a really really good way of of of getting your data in a structured format and this format You can more or less decide yourself as so you never have to reformat things just based on the fact that the database changed Head like you you have with FTP files or when you make the files yourself So there are some core concepts when you're dealing with bio mart So a mart is an information provider So a mart is kind of like a supermarket and the supermarket provides a certain type of information Well, it's not so much a supermarket. It's more like a specialty store, right? So a specialty store Cells for example shovels and another store sells clothing and then you have a third store and that sells books So the mart is the information provider. So for example ensemble is a mart, right? Because ensemble is an information provider an Information provider like a mart Contains many many different data sets or can contain some of them contain only one data set Stuff like ensemble contains many different data sets And so a data set in ensemble is usually structured around an organism. And so you have for example the data set Mouse genes you have cow genes you have human genes, right? So those are different data sets Another the other three terms that we use here are more or less the way that you can query the data So we can ask for certain attributes Yes, so you connect to a mart and a data set and then using the attributes you can get or you can you can tell The information provider. I want to have this columns with this type of information And I'm requesting that Then you have filters filters are things which allow you to filter the data that you retrieve for example a filter car could be the idea of the gene or the name of the gene or The location on the genome has so that is a filter that you can apply and you can apply Different filters and you can combine filters together to kind of filter down your data And then you have things which are called values and those are the things that you are querying for so that just a little bit of terminology before we start diving into kind of how to now use bio mart So for this example, and I'm not going to do this live I was originally planning on doing this live, but the last week ensemble and bio mart are having some serious connection issues So had then it would it could mean that we're sitting here five to ten minutes waiting for data to come back from bio mart So just to prevent that I made slides for you guys And so that we can just go through the slides and and we don't have to wait for queries to finish or If we're unlucky the whole system is down, which is it just happens, right? It's a community project And so it it has some funding, but it's not something like Amazon, which is a ninety nine point nine nine nine percent of time So let's say that we want to investigate mice right could be plants could be Humans but for this example, I took mouse and we want to look on chromosome three Between 15 to 45 megabases right though. That's the interesting region that we just decided that we are interested in And so for example, we want to know what are the important genes at this location and what We want to get all of the genes in this region. We want to get their location We want to get the number of axons that they contain and for example things like the number of functional snips Inside of these genes that we are interested in and so these are very common questions in bioinformatics when you are dealing with Have for example, you did a QTL mapping and you identify that this region of the mouse chromosome is controlling the length of the tail or The size of the ears of the mouse and you now want to go and see well What gene might be responsible for that and are there any variants in this gene region? Which might be causing this phenotype to be different between different mice that we have All right, so in This case we're starting to do are and of course when we create a new script as always you want to make a good header Right, so the header in the file. I wrote it down here. So I'm going to analyze mouse chromosome 3 in this region Hey, it is copyrighted by me. It was first written then and was last modified So this is kind of the standard information that I put in all my scripts, right? And after we've created a good header, we load the biomark library Of course, we have to install it first and but after we've installed it We can just load it and then we have to set a working directory because we have to work at a certain path or at a Certain part of our hard drive because we just don't want to save everything in see user documents And so we set our working directory to where we want to store our data So where we want to write files and where we want to read files although we're not going to read files And because we're going to get our data from biomark into our directly All right, so that's now we need to Connect to a mark right and since we're interested in mouse and we're interested in genes and genomes head then then we Can just connect ensemble. So the first query that we do in our in in our is say well use mart ensemble And then what this will do just like the database provider in our SQL light It will give us back a connection object. So this connection object is a variable Which contains the ability to do queries on and so we can use the list data set function To list all of the data sets which are available on the ensemble Information provider has so the the ensemble mark contains different data sets and had these data sets We want to just list so we can use the list data sets and when we do that we get back something that looks like this, right? So it's a it's a massive list of all the different data sets that ensemble provides and you can see that most of them are Name of the species that you are interested in then an underscore then a gene and then an underscore ensemble So these are the names of the data sets in there and then there's a little bit of a description because hey Who knows what an a Mexicanus is? Well, it's actually a cave fish. I Didn't know that but if you if you're interested in your species then generally, you know the scientific name But if you don't and then you can just look at this this this second column and the second column will contain the information on what kind of what kind of species you are looking at the third column Did you just say gay fish? No, no, I didn't say gay fish. I said cave fish as in a fish that lives in a cave I'm sorry. I'm sorry that it no, we're not we're not talking about What's he called again Kanye Kanye West no, we're not talking about Kanye West. We're not saying gay fish Although I'm just said gay fish So If Sorry Now I'm from my up to pole. Yes. So the first is the name of the data set The second one is the more or less description Do you like fish sticks? Yeah, I like fish sticks All right, come on people like it's I know it's I know we're already at this for like two hours and and 17 minutes or something like that But two hours and 20 minutes actually So I'm using my overlay. I can see how long I stream this I really like the thing but so The the first column is the name of the data set the second column is just a human readable description Which doesn't use the official term and then the third column is the one that that is interesting And this is something that you have to more or less Remember and the thing that you have to remember is of course which version of the database are we using right? So you can see here for example that the gay fish database the cave fish database is p marinas Underscore 7.0, right? So this is if we write a publication and we want to mention on which version of the genome we did this Then we have to specify This this in our publication, right? So how we use bio mar two or three of our genes and we are using the cave fish gene database version 7.0 and of course for every different species you have different versions and they get updated independently of each other All right, so that's the first thing right so have we go through this whole list, which is a long list and we we actually Don't have mouse. Oh, yeah here all the way in the bottom must must close gene ensemble I know it's very small But head that so head contains the version of the mouse genome database that we're using And so let's connect to this database. So again, we we call the use mart function and this time we the first the first parameter is the data provider that we want to collect to and then the second The second term or the second parameter of this function call is now the data set That we want to connect to right and we again just overwrite our mart object And this this will now allow us to query The attributes and the filters which are available for this data set because of course different data sets have different Attributes right every every one of these data sets is kind of a big Excel table with information But some things are available for mouse, which might not be available for cow And some things might be available for fish which might not be available for mouse And so we have to know which attributes we can query and which filters we can use to ask The database right so and as a tip normally there are like thousands of attributes So I generally focus on the first like 20 or 30 or I use the rep command To search for something very specific, but just saying list attributes mart We'll just run down like a massive amount of data on your screen And then you have to scroll up and start looking and so it's generally the things that you want are in the first like 50 to 100 Attributes because those are the most used attributes So if we list the first 10 attributes from the mart right then we can see that we can retrieve things like The ensemble gene ID we can retrieve strings like the gene description The chromosome name the start position the end position Is this gene on the positive strand or the negative strand? On which band is this and this is relatively old and I don't think we talked about this But if if it in the old days when before sequencing become became very very Well cheap, it's not cheap, but it's cheaper than it was right, but but like 15 to 20 years ago locations of genes would be by chromosomal band, so It would say 5 p 7 q 3 and this would describe where on the genome it was and this is this is based on More or less looking through a electron microscope at the chromosomes And then when you see the chromosomes in the electron microscope You see these little Bandings and these bandings are due to higher and lower GC content But they used to originally use the bands to identify positions on of the genome and something like omem They they still mention which band a certain association was or is on And then we can list the filters so the filters are things like chromosome names So we can filter down by chromosome and we can filter down by start and end position We can also filter by the bands of course and we can use things like markers and we can filter based on strand But the thing that of course we want because we wanted to look at mouse genes Chromosome 3 from a certain position to a certain position Number 9 is the filter that we want right because the filter that we want is the chromosomal region At which allows us to specify a chromosomal region saying 1 chromosome 1 double-point from a hundred base pairs to ten thousand base pairs on the negative strand or we can say from chromosome 1 From a hundred thousand base pairs to two hundred thousand base pairs on the positive strand Right so had the attributes are the things that you retrieve the filters are the things that you are going to provide the database And so the filter that we need is obvious the obvious filter that we want to use here's chromosomal region because we're interested in mouse Have we found a QTL so we found an association and now we want to kind of drill down Which genes are there and and which other things can we do with it? So let's define an initial value for our filter So I'm just going to say that okay, so I have to make a decision right Do I want to look at the positive strand or do I want to make look at the negative strand? But in this case I just want to look at the positive strand right and normally you would look at both strands And you can actually drop the the fourth parameter here So you can just say three double point fifteen million double point forty five million and then you close your string But that the last double point one It allows you to say I want only positive or only negative strand But you can just leave it out and then you get all of the genes But for this example, I just said well, let's limit ourselves to genes on the positive strand and I Actually had a discussion with a PhD student recently who said the next time that I'm going to look for a PhD position I'm going to Make sure that the gene that they are working on is on the positive strand because Strandedness in DNA. It just drives you crazy. If your gene is on a negative strand You will mess up the direction because it's just it's much easier when your gene is on the positive strand So in this case even though the gene the causal gene might be on a negative strand We're just limiting ourselves to forward strand And of course we want to try to retrieve all the genes that are there So we we try to retrieve gene IDs And so I define my region and then I've defined my attributes So I just make a little variable and the attributes that I want to retrieve are the ensemble gene ID, right? So just just give me the IDs of the gene in this region Right and now we can execute the query and this is kind of the core of biomark So biomark provides this get PM function, which is get from biomark and so we execute the query using this function So what we say is we say get biomark the attributes that I want to retrieve are my attributes The filter that I'm going to use is chromosomal region and the value that I'm going to provide the database is the my region And then I have to specify in which mark I want to search So I have to specify mart equals mart because my variable that I define the connection variable is called mart, right? Define that before And so the mart is mart just means that use this information provider and this data set So after I execute this query I can of course just ask for the number of rows of the gene IDs because that's the variable that I store it in So I get back a matrix and this matrix has 211 rows, which means that there are 211 genes in the positive direction on the DNA in our Region that we were interested in so in the region chromosome 3 from 15 megabases all the way up to 35 Megabases and of course you could have done this in a different way You could have gone to ensemble you could have loaded all of the mouse genes and then load those into R and do the filtering yourself Have it then of course you're spending a lot of time for something that biomark just provides for you so So of course we can get our attributes, but we can extend our attributes We can add more attributes to get more information because getting the ID is not enough Have we we also want to get some other information for example We might want to get the gene symbol have because normally you don't mention the n smooth G and then the ID You just mentioned the the mouse genome in information system symbol and so just like in humans mouse genes have Names which are more or less They are decided on by a committee and they're linked to humans had to to have a common vocabulary So when we talk about a certain gene everyone knows which genes you mean Had that I'm getting the description of the gene as well I'm getting the chromosome name so because I want to know where the gene is located So I'm getting the chromosome name the start position of the gene and the end position of the gene Yes, so I can just extend my attributes with making a little vector with all the things that I want to retrieve Again, I can use the get biomark function and here I show you the first 10 elements And so what you see is that more or less the first gene located in this region I don't know if the list is sorted by start position probably not probably just Gives it back in a random order, but had the first gene that is there It's called GM 24704 and it is a predicted gene. So it's not a real gene It's just the computer said that there was a gene or there this region looks like a gene So head doesn't have a it's not a proper gene so to speak and it's located on chromosome 3 from here to here You can also see that it gives back micro RNAs and stuff, right? So Why that's I didn't want any micro RNAs or predicted genes I was actually interested in real genes, right real genes which code for proteins have because I Have this association and I really want to find a gene And I don't want to deal with like things like micro RNAs and just predictions, right? I want to make sure that it's real. So, of course, I can now do two things, right? I can Add the biotype as a column. Yes, so I'm just going to say my attributes and then I'm going to say comma Biotype and then it retrieves the biotype because the biotype is either Micro RNA link RNA predicted gene or protein coding gene, right? So a protein coding gene would be the thing that we are looking for So you can add the biotype and do the filtering yourself in R But you don't have to and you can just make your filter more complex and you can filter for the genes biotype And so instead of just retrieving all the genes which includes micro RNA and predicted genes And we can just make our filter a little bit bigger like we did with the attributes and to Tell the database that we are only interested in protein coding genes So here this is a little bit different So we have to define our filter saying that well We are going to now provide you with two pieces of information that we want to filter on The first piece of information that we want to filter on is the chromosomal region and the second piece of information that we want to filter on is the biotype so the type of gene Of course now I have to define the values that I want to search for and here You have to be careful because this now needs to be a list because you can provide multiple regions, right? I don't have to limit myself to one region and I don't have to limit myself to one biotype I could specify three biotypes or two I could say give me all the protein coding genes and micro RNAs but because This can be a vector and this can be a vector if I want to now Define my values then I have to use the list function and because the list function can store in the first element the Vector which contains all of the regions that I'm querying and in the second element of the list So in the second part of the list I can say multiple things like protein coding micro RNA and then these I call my values and then again I just do the query so I'm saying give me the attributes the filter that I want is my filter the value is my values, right? That is the variables that he defined and again query from the same biomark All right, so then the question can be answered. So how many protein coding genes did we now find in this region? Well, we just look at the number of rows of of what we retrieved And it tells me that there are 58 genes on the forward strand on the mouse chromosome 3 from this position to that position And then we can start with with the next part, right? And the next part what we wanted to know well for each gene We want to know how many axons it has and we want to know how many snips are located in the axon So how many snips are there which have a possibility to change the protein? So to change the amino acid All right, so let's retrieve the gene data or let's retrieve the axons first Because we need to know the axon locations before we can query snips and we have the gene start and end position Of course, but of course a gene has introns and axons and we're not generally not interested in in snips Which are located in introns and we're only interested in axons Which are has so parts of the of the protein are parts of the DNA which are coding for the protein So 58 protein coding genes and of course we want to use the computer And so we need to go through the different genes Yes, so we can use the ensemble gene ID now as the filter and then we can query per gene information Right, so we we are going to write a for loop and in each loop. We're going to request all the axons which belong to that gene and How do I do that? Well, I'm just going to say well I now want to retrieve gene specific attributes. I'm calling this gene at root So what do I want to retrieve? Well, I want to retrieve the ensemble gene ID. I want to retrieve the axon ID I want to retrieve the axon the axon start and the axon end position Yeah, of course I don't have to ask for the the axon chromosome because I already know that I'm only working with chromosome 3 But you could add the chromosome to it if you are querying multiple regions at the same time So what do we do? Well, we say just for gene in the gene IDs that we got back, right? So I'm just using the data that I got back from biomarket. So what I'm going to do Well, I'm going to now provide biomarked with the ensemble gene ID So I'm telling it that you will get an ensemble gene ID from me. Which ID am I providing? Well, that's that's this gene value, right? Because we're going through the list one by one And the attributes that I want to retrieve are the gene attributes Yeah, so when I when I run this little piece of code and now I put in a cut here to just So the cut here says well show the gene name and then show how many axons we retrieved So for example for the first gene, so 39335 Have we retrieved 13 axons the next gene has 16 axons and the next gene has eight Right. So this is this is very easy very quick and have Just a very basic for loop And and and we just retrieve the data directly and if I save my code and run it in like two weeks, um, then numbers might change right because Data about genomes is always progressing Yes, so someone might find that this gene actually has not 13 axons, but 16 And so then this number will automatically update. Um, and I don't have to do anything for that all right, so, um For the final question, how many snips are there? Um, so the the snips are actually in another mart So they are not in the gene mart, which is logical. They are in the db snip part Yes, so they are they are they are a part of ensemble, but they are not Um, um, they are not in the same data set. So there's a different data set And they're actually in a different mart. So they're actually in a different data provider. So ensemble As a data provider itself provides you gene and axon and these kinds of information But if you want to connect to snips, um, then the mart that you want to use is called ensemble mart snip Um, and then of course, we want to look at the mouse snips So that is the musculos snip and we just call this snip mart instead of the mart So we don't want to override the old one. So So when we do this how we can say snip mart is used mart ensemble snip mart Take the snips from mouse and then of course we can list the attributes again And here we see that we can get the ref snip id. So the rs id We can get the source we can get the description the name the start the end The strand at which it located the alleles says so is it a ct snip or is the ta snip or a gc snip Um, and then we can also get things like r is this snip validated Has so what information is in the database that supports that this snip is real Like how many people have reported this snip if it's only found by by one group from Uh afghanistan then of course, you're not really going to trust that snip although it might be real but you don't know but if this snip has been reported by 15 different groups from from harvard to oxford to to other like well-known universities like The prestigious humbled university had then of course had then the validated gives you more More certainty than that this is real And of course we we have to list our filters How can we how can we tell the database what we want to retrieve right and here again? We see that we can use the chromosomal region and in this case We would want to use the chromosomal region again because we have for each axon The chromosome the start position and the end position So there's no filter actually here for ensemble gene id because snips are not coupled to genes They are they are located on the genome So we need to use the axon start and end position And so we can query all the snips using the chromosomal region again And then we can filter again based on things like the consequence type. So the consequence type could be Non synonymous right as in changing the amino acid structure But we could also filter based on for example the sift score So the sift score is a prediction of how impactful this The impact will be and and this is left as an exercise for you guys Because i've been been talking already for a long time like two hours and 40 minutes And you have to do something during the exercises as well So the exercises will be using biomark to retrieve Part of the of the snip database and filter down All right Tier white white tier is there a tier? Oh, that's interesting Yeah You see the mood box. It actually went wrong. Why is there sometimes weird? I found a bug. Yeah, you're hacking my mood box. Don't hack my mood box. Don't don't do that Interesting interesting I can I can fix that later on There's no reward. No, there's no reward for finding bugs. Um, well eternal gratitude. That's a reward in itself, right? All right, so today I told you about why we need to use databases. What the organization of databases is. What features do they have? Just as out as you need to use capital letters And and don't shout poop in my my chat Don't shout poop that like i'm not going to ban you or or put you on a timeout, but um So It has to be a single word you have to do like this All right, so I told you about databases organization features types of databases this some important databases some examples And I told you about bio martini ensembles that belong to that So that's more or less it for today. Um, so are there any questions and actually welcome to the stream, uh, testers ours Welcome to the stream, um, and um Questions I will actually stop the recording now and then we can just use the last like 20 minutes