 if you're watching it on Twitch Live or if you're watching this on YouTube, welcome back for part number two. So like we talked about databases a little bit like what are the advantages, what are the disadvantages, and one of the nice things is that you can also use databases in R directly. So one of the packages that I've used in the past to directly make a database on my own hard drive and to use it is our SQL lite. So SQL lite is one of the most commonly used databases especially in app development on mobile phones because it's a very lightweight SQL database and of course if you want to get an overview of all of the commands which are supported then you can just go to SQL lite language. There's the description of the language because like I told you guys every SQL engine has their own little dialect and SQL lite is a limited version of SQL. So you can do a lot of things but they're not not all of the things that like big databases provide like professional databases where you have to pay for and it's a free database engine. So you can install it of course by just using the install the packages function and then just say R SQL lite and this will install it into R for you. So the way that this works is you just have to create a database driver. So in R you can issue the following command saying make a driver use SQL lite as my database driver and you can also connect to other databases. So it's not just for SQL lite because instead of writing down SQL lite you can also write my SQL or Oracle or Progress SQL. So you can use this database connector to not just connect to SQL lite but also to all kinds of other databases. After you've connected to a driver you have to create a connection. So the nice thing is in R you can actually also connect to a data frame. So just to a single table that you have loaded in R. So if you want to connect to a data frame in R and then do SQL queries on that data frame you can just say DB connect use my driver and then give it the data frame name. So the data frame name is the name of the variable that holds your big matrix and then using the connection you can start querying. You can also connect to for example an existing database on your hard drive. So what you do then is say system.file in the folder data there's a database called database underscore name and then you connect to the DB file which is just the file on your hard drive and this is this is really useful and why is this useful because we can now use SQL in R right. So we can do things like create insert into select from update and delete right. So it gives us these commands to work. So if you're there's a lot of information and examples in the SQL light package I don't want to go through all of them but here there's a direct link so just go to the PDF and click the link and then you get an overview and examples on how you can query your data frame or how you can query SQL light databases directly from R. So why do you want to use our SQL light? It avoids some of the complex R commands that you use for selection and merging of different tables right because I can say select from table A the entries where the foreign key constrained of table B is larger than 16 right. So there's there's it avoids a lot of this complexity where when you're trying to merge two databases or two data frames or two matrices into one in R that's a little bit cumbersome because you have to make sure that the row names of the first one are matching the row names of the second one or you have to kind of make this selection yourself. But the nice thing is if you know SQL hey you can avoid some of this complexity by just using SQL commands. Another advantage is that memory is managed much better right. We don't have to have the full data set memory. Imagine that I have a data set which contains six gigs of information then loading in this file into R will use up most of the memory that you have available in R and that is one of the drawbacks because like I told you guys in the R lecture R is very poor in memory management. Everything is in random access memory so that limits you to the amount of random access memory that you have in your computer. By using R SQL light you can actually connect to a database which is hundreds of terabytes big if you have a hard drive which is that big of course but then you can query that and R will only load into memory the results of your query right. So if you imagine a massive table with millions and millions of entries and hundreds of columns right then by saying no I want to have from my data set only the animals which are older than 10 weeks and which have a weight below 50 grams then it will only pull out that subset of data from the hard drive and then bring that into R to make it available. So the database driver is very smart in handling memory and hey because you are generally only working with subsets of your data it is just better to just not load the whole thing and have to wait half an hour for the thing to load into R and it is also much faster so R SQL light is optimized for doing commands on big big data sets so hey if you if you do a query where you say select from this table these things where these columns contain these values then this is much quicker doing it using R SQL than it is using standard R commands. So searching a database of course there's many ways that you can search through a database you can do a text base sequence base motif base structure base mass base if you're talking about mass spectrometry right so but you always have to remember when you select the database do they provide a bulk data download or do they provide an API. So the list of important databases I think everyone knows PubMed so PubMed is the standard database for scientific literature so and it's run by NCBI so it's called NCBI PubMed and this contains more or less all of I'm sorry this contains literally all of the literature that has been published in the last like 25 30 years so if you're interested in or if you want to do a query and say give me all the papers about microRNAs in cows that have been published in the last five years then PubMed is your place to go. Ensemble we saw hundreds of times already during the lectures and this is a DNA database we also have GenBank also a DNA database the DDBGA also a DNA database for proteins we have four main databases so we have Uniprot, Trembl, Pierre and PDB I think we already looked at PDB and I think we also looked at Uniprot. Trembl we will discuss in a later slide but all of these databases are protein databases and contain information about protein and protein structure so here you can for example search for a certain motif and say well give me all the proteins which have a DNA binding motif. We have the Ndebe which is the nucleotide database which is for nucleic acids so those are databases which contain kind of it's a mix between DNA, RNA, microRNAs and these kinds of things. There are of course many many many many other databases like we have Procyte and Payfum when we're interested in protein families or when we're interested in protein motifs. We have Keg and Reactome which we already discussed in a previous lecture and we have DB Snip which is the database which contains single nucleotide polymorphism data and we have DB EST for expressed sequence tags. We have the OMIM database the online Mendelian inheritance in men. We also have the OMEA database so the online Mendelian inheritance in animals so these two databases contain information on Mendelian traits in humans and in animals and if you're interested in QTLs because we also talked about the QTL we already did the QTL lecture I think yeah we did and if you are interested in which QTLs have been published in the last 12 years in cows then you can go to the animal QTL database so the animal QTL database stores information on who found a certain association with a genetic region how strong was this association does the association increase or decrease your phenotype and it does this for all different animal species so it doesn't matter if you're interested in mouse or goats or cows or lizards there's bound to be a species or your species is bound to be in the animal QTL database so PubMed like I told you guys it contains scientific literature like we see here so if we just search for PubMed in Google then you get to the main site which looks like this you can select here different databases because NCBI doesn't only have the PubMed database it also has databases like nucleotide core which contains all of the core nucleotide sequences and it also contains protein databases but PubMed is the most up-to-date scientific database so an overview of NCBI is of course it's kind of the starting point for biological knowledge and information and so Ensemble is one of these NCBI data no Ensemble is from Enble so NCBI is the National Center for Bioinformatics so they run PubMed so the scientific literature but they have also a lot of different databases so hey you can if you're interested in things like DNA and RNA or if you're interested in genomes and genetic maps or if you're interested in proteins or homology between different proteins then NCBI can help you there it has a lot of built-in tools as well so if you want to do local sequence alignment it it has a built-in BLAST function but it can also cluster data for you and the nice thing is is that it has a very good automated data retrieval system so you can connect to the database from R directly and then download the stuff that you want into R by using the NCBI connector NCBI is an Entress database so we have Ensemble and in Ensemble every gene has an Ensemble gene ID so that is their way of coupling all the data to this gene right so they their primary key in Ensemble is the Ensemble ID in NCBI everything works with Entress IDs so they have the same gene has an Ensemble gene ID but it also has an Entress ID so the nice thing about all of the databases at NCBI is that you can query them all at once so let me let me show you guys this can I just I can't click it so let me open up my Firefox window for you guys and then just go G query NCBI all right so if I want to know just close this thing because we're not really interested in the COVID thing but hey you can see here that it has like the literature database and if you scroll down then it has genes and proteins and BLAST and all kinds of it also has chemicals in there but the nice thing is is about hey that's not what I want did they rename it to search I thought it was always called G query yeah so it refolds to search so they changed the URL and we'll have to update that in the thing but if I want to know something about well PBS7 favorite gene of interest right so we can just say search and then here it will show you all of the databases right and in each of these databases it shows you the number of hits so you can see that in in the bookshelf which are books there are 34 books which deal with PBS7 well not just PBS7 but books where PBS7 occurs hey you can look into PubMed you can look into PubMed Central right so you see that there's like 560 papers being published about PBS7 as a gene and there are 525 entries in the gene database there are 12 on around 1100 entries in the protein database that there's also things like clinical trials it also occurs into OMIM so it gives you one place to search for everything right and you can you can you can search for anything that you want so you could search for Google as well which of course occurs in scientific literature but there's no gene called Google right so but this is really useful and so if you want to know something about a certain gene and for example you have to write your master project and your master project is about obesity right just go to the search to the G query search and then head just throw the gene that you are researching in here and then you can see what is known had the genomes like bioprojects and biosamples mean that they're sequencing data available of this gene short read archive is an archive which contains like whole genome or not whole genome but it contains short read sequencing so it's a it's a really interesting search function and it allows you to very quickly get up to date with all of the different features and all of the different literature and things which are available for your gene so it used to be called G query it's now called just search so that's nice so the nice thing is is that when you search you can use boolean operators like and and or and not right and you can use parentages to change the priority so for example if I'm interested in D got one which is one of these genes controlling milk yield in cattle and I want to then have papers or publications where D got one is mentioned but also it has to mention boss Taurus or boss Indicus but none of the other boss species so this is the kind of European cow and this is kind of the Indian cow and so you can you can build up these very complex queries and get a lot of information from it it defaults to all fields but they also have a query builder where you can build your own queries right so if we want to for example have a range we can search by date or molecular weight or the length of the sequence and so in theory if we wanted to know everything being published between January 2008 and November 2011 and then we can say well give me all of the information on about D got one in boss Taurus so only one of the two and in this publication date range and here we can then specify that this is the publication date so if I'm looking for a very specific paper and I know that this paper was published somewhere early 2000s and I know the last name of the author and I know kind of the gene that they're talking about then this will allow you to very quickly find backpapers that you are looking for it's has popular search limits there are different special cases like you can use author names or database IDs one of the nice things is that you can actually use wild cards and query truncation and so a wild card is using a star and so for example I want to know everything about D got one and boss star so that selects all of the boss Taurus boss Indicus and boss Musculus organisms and these kinds of things right so you can use all of these so had there's a really nice overview so on their website they actually list all of the different things that you can search by so things like publication date or organism or gene name it understands that you're then searching for a gene or for a certain organism and you won't be spammed with all kinds of search results which are not really useful the nice thing about NCBI is is that it also contains a download section so hey if you if you want to download for example a complete data set then you can just go to the home page and then you can say download slash FTP so you can directly go to the FTP server and download the whole database and use it locally instead of having to do queries and build up your own Excel file so and of course this is this is important for a database so ensemble we already talked a lot about ensemble so we'll just quickly go through it so it's there for comparative genomics it have it allows you to look into evolution sequence variation how transcripts are regulated and so ensemble annotates genes computes multiple alignments it predicts regulatory function and collects disease data different tools are BLAST and BLOT and Biomart the variant effect predictor is a very important database it hot coffee nine five five thank you for following so the variant effect predictor is a very interesting tool and you give it a single nucleotide polymorphism and then it will tell you if the single nucleotide polymorphism is modifying an amino acid in a protein or if it's not if it's located in a regulatory region so the variant effect predictor is a very very useful tool if you want to drill down like which single nucleotide polymorphism might be causal for my phenotype and of course it provides integration with many external databases I hot coffee nine five five welcome welcome to the lecture thanks for following as well directly that's cool so ensemble is also the home of the encode project so I already told you guys that after the human genome project the encode project is kind of the big next step in human genetics have because with just the sequence you don't know anything right you just know the order in which the ACT's and G's occur in in a certain genome but the encode project is a big project a multi billion dollar project with many different universities participating and what they did is they build a comprehensive list of functional elements located into the human genome is there there is a Linux distribution yeah yeah yeah yeah there are many different Linux distributions well not just for bioinformatics but also for for other tools yeah they provide tools so that you can just run it into a virtual virtual machine or you can run it on a cluster so the nice thing about the the ensemble the encode project is that they provide a lot of data for free right so it has produced genome by data for over a hundred different cell types and they they are very they are they try to annotate this sequence right the sequence itself doesn't really teach you anything it is just data it's not knowledge and the encode project is there to provide the knowledge right so what they did is they looked at things like chromatin structure like is chromatin open in this cell type or is it closed and if you know that chromatin is closed then you know that all of the genes in this region are not being transcribed anymore right so if you know that a certain gene that you're interested in is expressed in a certain amount of tissues and you then see yeah okay there's a there's a chromatin structure nearby or a chromatin modifier like a histone then how you can kind of start reasoning about okay so it might be this histone binding the DNA curling up the DNA around it making the gene inaccessible for transcription so they have histone modifications they have DNA binding motifs of over a hundred different transcription factors and they also provide a lot of free RNA transcription data so how highly is your gene expressed in a certain tissue and they did that using RNA second cage so the encode project it lives at ensemble so if you are interested in regulation of genes or the expression of genes and how it's different between different types of tissue then do take a look one of the databases that I use a lot and I talk about a lot as well is the DB snip so it is the single nucleotide polymorphism database and it is for humans it used to be not only humans it used to be the DB snip would accept any species but since like four years they said like no this is this is too much information so we just want to focus purely on humans so only single nucleotide polymorphisms in humans and so they contain not the single nucleotide polymorphisms but also micro satellites small insertions small deletions they provide the publication which found the single nucleotide polymorphism but they also give you things like the frequency inside of a population right so if you're interested in a in a certain snip is it more common in Asian people or is it more common in European people then DB snip can tell you that and of course they have other information as well and they have information about common variants and clinical mutations which is really nice if you are interested in certain diseases and if there are snips known to cause variations in in clinical diseases so if you look a little bit closer in DB snip then they have two types of IDs so their database is more or less split into two right so they have primary keys for SS IDs and if primary keys for RS IDs so when you submit your information to DB snip you get a an SS ID for your mutation that you found and this is called a submitted snip once they have reviewed the data and makes made sure that the snip is really true so have once other people have found it then this SS IDs are being transformed into RS IDs so this means a reference so that means that the snip is real that it has been seen by multiple people or has been sequenced by multiple people and like I told you guys since 2018 all other species are not stored in DB snip anymore but they are taken over by the European Variation Archive so if you if you want DB snip for cows then you go to the EVA so the European Variation Archive and then EVA does all species besides humans so for humans you have DB snip for cows goats you have EVA so the PDB we already talked about it I think in a previous lecture so PDB is the protein database it contains information about different protein structure different function and the nice thing is it provides three different visualizers so if you want to look at protein structure then you can use the NGL the ESMOL and the PV viewers so the PV viewer viewer is kind of the newest one it uses WebGL and modern web browsers so generally the PV viewer is the one that that is the default currently I think but there's other ways of visualizing and each of these they have their own advantages and disadvantages but just be aware that PDB is the database for protein information and protein structure and you can view the structures even online on your mobile phone and look at how for example certain mutations might change the structure of a protein so protein databases also provide PDB also provides sequence and structure alignments so you can do pairwise sequence alignments like Blast2Sec or Needleman1 and Smith-Waterman pairwise alignments but you can also do structure alignments so structure alignments means that you're not looking at the or you're not aligning a amino acid sequence to a similar amino acid sequence they are looking at structure so as long as the shape of the protein or the part of the protein that you're interested in is matching your query then it will give you results they also look at things like protein symmetry so if you want to know if a protein is symmetrical and how many symmetric axes it has and this can be important in some cases and the nice thing is is that it also provides the quality measurements of the protein structure so it will tell you how good a protein structure has been determined so hey is it is it reliable up to 5 Armstrong's or is it reliable up to 10 Armstrong's that's something that PDB will tell you so the protein sequence bay one of the other protein sequence databases is Uniprot so Uniprot is a very big database and it's divided into two different databases it's actually a combination of two so the arrows I point them down but they should actually point up because both Swiss broad and tremble kind of feed into the Uniprot knowledge base so the Uniprot is one of these other protein databases which contains protein information so if we look at the Uniprot database then it has manually annotated and reviewed records right so this is one part of the database so the Swiss broad part so the Swiss broad part is a manually annotated database which means that all of the data from this database is very high quality if there's something in there you know for certain that it's true because a professional or an expert in the field looked at the data so when I looked at the database last time in 2019 there were around 560,000 proteins located into this database which is not a lot right because we know that a human has 20,000 genes every gene produces around five different proteins so that would mean that a standard human has somewhere between like a hundred thousand to 200,000 proteins but of course there's many different species so having 600,000 protein entries is kind of having the whole human proteome for five different species it is updated every month and again it has many tools which are integrated so you can use them directly from the website or you can just download the data but you can see the growth in the database so you see that there's a lot of annotation being added from 2005 to like 2010 and it's not growing that quickly anymore the the other part right so the other is the tremble so because human curation is very expensive because you have to pay people to look at the data it's a full-time job going through these databases and manually curating the errors that are in there they also have the tremble so tremble uses the coding sequence and then translates this into protein sequence so it takes the big kind of genomes that are out there then looks to see if it can find a start codon and then starts saying well okay so I found a start codon and then I'm just going to make a protein sequence from this kind of piece of DNA right so it is computer annotated which means that it is not reviewed it's also monthly updated and in 2019 you can see that it has almost 180 million sequences and this is the big difference between databases that are human curated and which are computer annotated I told you guys in the beginning that I would show you an example of where computer annotation can go completely wrong and that is that that is exactly what happened at the start of 2015 because people figured out that the algorithm that was used to do this cds translation to protein sequence was not entirely correct this is due to the fact that we I've shown you guys this amino acid wheel and then I told you guys be aware that the universal genetic code is not as universal as you think the same thing is what happened to tremble so they assume that the genetic code was like the same for every species in their database which actually led them to massively like over kind of produce proteins because that by by doing this translation they just did not look into kind of specific details for specific animals right so a start codon is coded as a methionine but this methionine can be coded differently in different species so they were just using kind of the human annotation so the algorithm based on on what we know from humans and then they were also using the same code to to annotate extremophiles leading to a massive amount of wrong database or a wrong protein sequences in their database which they then had to kind of correct so that's what you see here very clearly is that at a certain point they had to go into the database and manually delete almost half of the data in there because half of the data in there was complete nonsense because it was a prediction by the computer but the computer just predicted it wrongly and this is why if you if you write papers always try to use databases which are human curated good that's difficult it's 242 and I was actually hoping to have all of the stuff in the first hour and then do this in the second hour but we were already like 40 minutes into the second hour that's always the difficult part about getting back after holidays is you always overestimate the speed at which you do things I think I'm just going to continue for 15 minutes I think I can get through it and then we will in the third hour have you guys watching me do my goat analysis and use bio Mart so bio Mart I told you guys that I'm really really excited about this tool it's one of these tools that in bio mathematics you can't really do it out so the nice thing about bio Mart is that if I need my data in our right I can do three things I can manually search through databases and create an Excel file let me show you guys an example of where I did that actually so since since I'm a little bit like out of schedule anyway so when the pandemic started in beginning of 2020 right I was very interested in COVID so I did the thing that I do as a bioinformatician so I started looking through the ensemble database to see what was known about corona viruses so I created let me see where it is where are you where are you where are you let me see Excel Excel Excel that's not the one that I wanted to show you guys that's the registrations for the our course in the summer semester that's also not the one drive where where where where there it is XLS COVID-19 all right so let me add a window capture I have to open it up then loading loading with smaller and then add a window capture all right so right because as a bioinformatician have when something like a pandemic happens then of course you you you want to know more about it right so the first thing that I started doing was just make a big sheet right so I just went through the database and I searched this for corona virus right and then what I was interested in is was trying to figure out if it is a naturally occurring virus or if the rumors that were online going around saying that it's made in the lab and it contains part of HIV and it's gonna kill us all if that was true right so the first thing that I did is in already December 2020 I think yeah so the first December no December 2019 so the first time that I heard of the virus I already started scouring through the ensemble database to see all of the different coronaviruses which were out there and of course you can you can do this right but you can see that this data is relatively incomplete but at that point in time there were around a hundred and sixty one sequence coronaviruses so I just went through the database and I just made my little Excel file saying that well this is a name short name this is the species in which they found it the country and what it was found the year the type the subgenus because there's many different types of coronaviruses and then the origin so have where did it come from or where is it located or who did the sequencing is there any publication that's attached to it and then what is the genome right so this is just the identifier for the genome and then I loaded in this thing in in R and then downloaded all of the all of the genomes automatically using biomark but I could have saved myself all of this trouble right because one of the things if you're doing stuff in Excel Excel will kind of screw you over right because if I write oct one right which is a very well known gene or oct one then it will do this right so it will turn it into 1st of October so there are a lot of publications out there that talk about the gene 1st of October while the gene is actually called oct one so Excel is not the perfect tool for gathering data so just as an example so but this took a lot of work right so it's if you if you search through the database click on the links create your own Excel file it is it is manual slave labor in a way so it like just collecting all of this data took me a couple of days of clicking and copy pasting and it's very error prone so I had to go through the list again make sure that everything matched because sometimes you would copy paste something in and it would change the format so one of the other ways of doing is just make sure that the database that you're searching through actually is has an FTP download or an SFTP download so that you can just download bulk download all of the data that you want then there's less chance for errors and have the big issue here is that different different servers might have different data formats so instead of having one script which can load in all types of different data you have to write like adapters which make sure that the data that you downloaded is put into the same format so you can compare between different databases. The next thing that you can do is use Biomart and Biomart is awesome because it retrieves directly into R and there's no chance for errors you don't have to deal with different data formats because you can specify the data format that you want and Biomart will just give that data to you in the format that you specify which is really really useful. So what is Biomart? Well it's a community driven project to provide unified access to distributed research data to facilitate the scientific discovery process so and it's not just for ensemble because I often say like oh you connect to ensemble but you can also connect to almost all biologically relevant databases so Biomart also allows you to search through PubMed to download literature and do literature research and it's not just for R for the R programming language but you can also use it from Perl from Python you can use it in SOAP and REST and XML but many of the big programming languages actually provide a Biomart package for you guys so hey it's not just R you can also if you're a Perl programmer you can also use Biomart if you're a Python programmer you can also use it so Biomart is a very interesting tool and it it it does the way that it does so well because it has very simple concepts so it has a concept of Mart like a supermart and that is the information provider right so ensemble is a Mart and CBI is a Mart, keg is a Mart, reactome is a Mart so what a Mart does so what it does it's that Mart provides things which are called data sets so these are kind of tables not really tables because they can span multiple tables but data sets are something which a Mart provides then we have something which is called attributes and those are the information columns that we want to retrieve from the data set right so this data set might span like a hundred different tables and all of these different columns are columns which you can query so if you look at the attributes those are the things that we want right but of course to retrieve data we also have to tell the data provider so we have to tell the Mart what are we going to give you so those are called filters so a filter is something that you provide to allow the database to understand what you want so a filter can be ensemble gene ID right then I'm saying to the database I am going to give you ensemble gene IDs but a filter can also be an interest ID or a genomic location and like there's literally hundreds of filters but the attributes and the filters and the data sets are unique to every Mart so ensemble provides different data sets than NCBI and keg also provide different data sets from NCBI but the filters are also different right so ensemble might allow you to filter based on ensemble gene ID while NCBI might allow you to filter based on interest ID because they just use a different ID and then it has things like values and the value concept is the things that we are querying for so as so I can say from the data set mouse give me the attribute which is for example we want to know the chromosome and the position of a certain gene the filter is the ensemble gene ID and then the value is the ensemble gene ID of the gene that I'm interested in alright so let's do a very quick example I hope that we can run through it and like how we can we can make the lecture a little bit longer instead of doing a break exactly after an hour so let's say that we want to investigate mouse chromosome 3 from 15 to 45 megabases right so we want to know certain things like we want to know where we want to know something about genes of course and so we want to know the location of the genes we want to know the number of axons and for example the number of functional snips so snips which are located in the in the in the axons of the genes that are located in this region so how do we do this well we start by creating a new script and as always when we create a new script we make a new header right so we write down and we say well this is the analysis of mouse chromosome 3 I give a copyright statement so I mentioned that I made it and that I'm working for the how Berlin and that this is the group that I'm working for when was it first written when was it last modified right this is information that's just to keep my own to remind yourself in like 10 years why you made this script we start by loading the biomark library of course we have to install it also in our so that's just an install that packages but after we've installed it we can just make the library active and then my scripts generally start by using a set working directory to where I have stored my data but in this case we're not using this but we just have to we have to go somewhere on the hard drive all right so and now we need to connect to a mart so let's first connect to ensemble so if I loaded the biomark library then I have this use mart function which allows me to connect to a data provider so I'm just saying connect to ensemble and then store the connection into something called mart then we can use the list data sets to see which what the data set is which data sets are provided by this mart right because initially I have no idea how ensemble calls their mouse data set we can also search via pattern right so we can also say search data sets mart is mart pattern is and then search for mouse or Homo sapien or whatever you want to search for right this is just a pattern you can use wild cards as well so you can also say M O star then it will give you back all of the data sets where there's an M and O and then anything behind it right and of course the nice thing about biomark is that we also want our research to be reproducible right because the ensemble crap see it did it again it did it again I hate when it does this where is this slide where is this slide here let's delete the E so horrible the autocorrect always screws me over with ensemble so the ensemble database is continuously updating right and like I told you guys I made my cheat for SARS like two years back almost well two years back at that at the moment but of course the database has been updated many many times so if I want to retrieve data now and I want to redo my analysis then I have to go to one of the archives and hey if I if I if there's a function called list ensemble archives which will go back in time right so in this case you can go back all the way to 2009 and do the analysis as if it were 2009 and this is really really useful because it allows for reproducible research which we want to do so have when you use biomark connect to the biomark but make sure that you know which version you are currently using and then write this version down inside of your script as a comment saying that this script is using ensemble version 103 which means February 2001 2021 right so but if we then do the list data sets right so instead of looking at the archives if we list the data sets which are available in ensemble we get this massive massive list of literally like I think it's almost a thousand different species right so it allows us to say well I want to connect to gene ensemble which are the chicken genes or I want to go to M domestica gene which is the opossum genes right so it allows you to select which data set you want to work on when we then want to query we can use the get biomark function right so we have the use mark connect to a data provider list data sets to show you which data sets are there but then I have to can use the get biomark function to retrieve my data and if I want to retrieve my data I have to specify my attributes I have to specify my filters and I have to specify my values and I also have to give it the mark connection object so we can also retrieve sequences we don't have to retrieve genes and gene IDs but we can for example use the get sequence function to retrieve any type of sequence so we can retrieve DNA and protein sequences in our directly from ensemble as well so for example if we want to retrieve cDNA sequences of the genes on chromosome 1 in mouse between 1MB and 6MB we can do something like this we say library biomark we use mark ensemble we use the most muscular gene ensemble which we got from this big list here and where we looked at all of the different data sets that are available and then I can just say get sequence chromosome 1 from 1MB to 6MB the type is the annotation in this case give me the ensemble gene ID give me the cDNA and then use the biomark that we use and if I want to get the peptide sequences I can just say sec type is peptide so back to the example right because we want to know something about mice we want to know something about chromosome three we want to know about the genes which are located between 10MB or 15MB which was it but a bit of 15MB to 45MB so let's go and do that right so again we have to connect to the mouse database so we say use mark ensemble most muscular gene ensemble and we store this connection to the mark and then we know of course need to know which attributes and filters are available because every mark provides its own filters and provides its own attribute so we have the list attributes and we have the list filters function which we can call on the connection object and this will give us a long list of all of the filters and all of the attributes that are available so generally when I do this because there are thousands of attributes available an ensemble I generally only show the first 20 to see if it's in the top right because the most used attributes are returned first by the function they don't provide you the they don't put it in like an alphabetical order they put it in the order of most used so generally if you want something it's generally in the first like 20 or 50 so you can just say list attributes of this mark and then subset the matrix taking the first 20 rows you could also take the first 50 rows but then you can kind of look to see what's there if you know what you want you can also search again so there's also a search attribute function which allows you to just input a search term right so ensemble gene ID might be a search term that you might want to search for it might be one of these attributes that you want to retrieve or chromosome right and then there's probably something like chromosome position or chromosome start or chromosome end but if you do the list attributes then it looks like this and of course there's a list filters as well to retrieve the filters so if we list the first 10 attributes then you can see that the things that we can retrieve from the database are things like the ensemble gene ID the transcript the peptide the exon the description the chromosome name start position and position the strand the band which we never actually talked about and it's not that important we can also list the filters right so what can I use to what can I give the database which it understands this is what can I get from the database and list filters means what can I give the database what will it understand so in this case it will understand things like chromosome name start and band start band and marker start and but also chromosomal region right so but in this case like the example we want to get all of the genes located on chromosome 3 from 15 megabases to 45 megabases so in this case the chromosomal region is the filter that we want to use right because we want to get all of the genes in a region so we want to say that the attribute that we want to retrieve is the ensemble gene ID right because we want to know which genes are there and the filter that we're going to use is chromosomal region so we define our filter right so the value that will go into our filter is is the region that we're interested in so it's chromosome 3 from 15 megabases to 45 megabases on the positive strand and we we want to retrieve the gene ID so I say that my attributes are the ensemble gene IDs those are the ones that I want so this is my value this is my attribute and of course I also have to specify my filter so I can now say get biomarked get the attributes that I requested which is just the ensemble gene ID use this filter and then the values are the region that I input it I can do this for multiple regions in in one go so I'm just just specifying one region now but if I make a C and then comma I can add like 10 20 40 100 regions and retrieve them all in one go so this is what I do say get biomarked do the call and then I get back a list or matrix which contains the data that I asked for and then when I look at the number of rows then it tells me that there are 211 genes in the positive direction on the DNA in our region right so that's that's the list that we get so we we actually want to have more info right we don't just want to have the ensemble gene ID that's very limited because then we don't know where the gene starts and where it ends but so in this case we might want to also have the gene name which for mouse is called the MGI symbol we might want to have the description of the gene the chromosome name the start position and the end position so I'm just instead of going to retrieve one attribute I'm now just going to retrieve multiple attributes nothing changes into the call I just update my variable call biomarked again give it the new give it the new attributes give I give it the same attribute I give it the new attributes that I want to retrieve give it the filter and give it the values as well as the mark that I want to search in and of course then when I look I get back something which looks like this right so it tells me that at 24 megabases there is a gene called GM 24704 and this is a predicted gene and that's what GM stands for and then we also have micro RNAs which are not really genes so of course we might want to focus a little bit right because we we don't really care about the micro RNAs or the predicted genes we might care about them but imagine that we only cared about protein coding genes then we can also do that right so we can we can then if I was only interested in in protein coding genes and then there's two options at this point we can add the biotype as a column and filter ourselves because the biotype will tell me if it's a micro RNA if it's protein coding gene if it's a long non-coding RNA and these kinds of things or number two we can we can filter in our ourselves which is not really the goal because we are using a database so we want the database to do this for us and we want to go into all of the data that's there and just get a subset out so how we make our filters more complex and we only retrieve the data that we need so make biomark do the gene biotype filtering for us so in this case we're going to define a new filter right so I'm just going to update my filter so instead of chromosomal region I'm also going to give a biotype then we have to define our values right because the my region didn't change right so I'm still interested in the same region but now I have to make a list right because I can have two filters but I could also have like five filters or ten filters but I can retrieve multiple regions as well so I need to use a list in this case because my region is a vector could be a vector the same for biotype biotype could be a vector as well I could be interested in protein coding genes and micro RNAs so what I do is I specify my new filter this is just a vector but in R I have to make a list saying that well this is my region in this case we only have one and I want to have protein coding so the first entry of the list refers to the chromosomal region the second entry refers to the biotype so again we query saying that well give them give them the attributes the filter that we want and then the values and then we end up with a new gene IDs right or gene ID list and then how many protein coding genes are in our region well there's only 58 so that's kind of what I wanted to know because I'm interested in the number of protein coding genes although I might be interested in a number of micro RNAs as well but this is how we can use biomark to kind of use the computers of someone else right because my computer is not doing anything it's just doing a query and I'm only getting back the data that I'm interested in so we started off by retrieving like 210 something entries and now we're only retrieving 58 entries which is really good because then it's quicker and it's better for R as well in the next step of course we want to do things like retrieve exons and also here biomark can do this for us it can it can retrieve a list of all of the exons and also the snips inside of the exons so in this case we can just use a for loop and go through the different genes one by one use the ensemble gene ID as the filter and then query per gene this information that we want for example retrieve the exons per gene and their start and stop location right so I'm going to define a new filter no I'm going to define a new attribute which is called gene attributes so for each of the genes in the resulting data set so for all 58 genes that I have I want to get back the ensemble gene ID the exon ID where does the exon start and where does the exon end and then I just use a for loop right so I take my original results that I got from biomark and I say for gene in the column ensemble gene ID in this matrix what do I want well I just want to again use a get biomark function and I say get biomark get the gene attributes so for each gene get the ensemble gene ID the exon ID the exon chromosome start and the exon chromosome end and then him in this case I'm not going to use any of this I'm just going to cut it out to the screen so when I run this I see that well this gene here has 13 exons the next genus 16 exons the next genus eight and so on and so on right so I can I can repeatedly use calls to biomark to just get the information that I want good and then the final question of course the snips so find the single nucleotide polymorphisms located into these exons they are unfortunately in another part they are not in the ensemble mark they are actually in the DB snip part so here we have to connect to a different database to retrieve the snip and so we have to we have to now create a new connection to a different data provider and in this case we would use the ensemble mark snip and most muscular snips because ensemble provides the interface for EVA and because DB snip only is for humans but since kind of it's it's the EVA but anyway this is kind of the way that you would build up these kinds of queries and where you would kind of download the data that you need and look into it good so again when I connect to the snip marked I can list the attributes I can list the filters so there's no filter in this case for ensemble gene ID so I need to use the excellent start and end position so I need to query all of the snips using chromosomal region use the biomark filters and supply the excellent start and end position and then we could do more filtering based on the consequence or the sift score but this is the exercise for today for the for the assignments good so I went a little bit over time well not over time because I only took two hours of the whole four-hour lecture but I told you about databases right like why do we use databases because they are more efficient they provide like checks on our data they can warn us when we're trying to input data which we shouldn't input I told you about database organization that you can have databases in different normal forms and that this depends on how fine-grained your data is kind of is pulled out of each other right the granularity of the data I told you about different features different types of databases and some of the important databases that are out there and some examples like we quickly looked again at NCBI and these kinds of things and furthermore I told you about biomark and I gave you a very small and short example which we will continue doing after the break so after the break I will work on my abstract for well not my abstract but from a colleague of mine who is very interested in African goats so I will just show you guys what I've been doing in the last two to three days since I'm back from holiday to kind of figure out figure out or get an idea if we can use goats from Africa to learn something about milk yield and about meat production of these goats so that's gonna just be a little bit freehand but I think it's important that you guys well it's not important that you guys watch me work because that's a little bit strange but it's it's a nice example because I'm at this point where I have to use biomark myself for retrieving genes from regions that we defined or that we found to be interesting for meat and milk production so I just want to show you guys how I use biomark on a day-to-day basis so there will be part three of the lecture but for like information on the exam this is more or less the end of the lecture so if you think that you don't want to see me watch or if you're thinking like I like screw that I can figure out the biomark thing myself I'm just gonna sit myself behind the computer and do some like biomark hacking then that's fine so I will upload the assignments to Moodle again and also to my website so you can get them there but for now are there any questions and if not then we we do a short break so but for the people on on watching it later on YouTube I will say goodbye and see you in the next video