 Recording, so welcome back everyone. I hope you enjoyed the GIFs. They were a little bit misaligned unfortunately So why do you want to create a database in R? We'll have a slide about that but in R you can use databases and it's a good idea to use databases in R sometimes and the RSQL light package is a limited version of SQL and it supports a limited amount of commands so SQL commands it doesn't support the full language but the language support is good enough to be Usable and very handy sometimes in R. So if you want to know all the supported commands, there's a there's a link Where you can read the language definition And if you want to use the package then of course you have to install it first And you can use the install.packages command in R And then you just have to specify that you want to install the RSQL light package So the way that this works in R is in R. You have to first create a database driver So this DB driver also provides a way not just to connect to an SQL light database But you can also connect to other database sources like MySQL or Oracle or ProgressQL And so the driver is kind of the connection objects in R. So you get a variable and using this variable You can then query the database. So how do you create one of these connections? Well, the package comes with the DB connect function and for example, you connect You can connect directly to a data frame which you have loaded in R And so if I do a read table and I read in a table or a data frame in R And then I can use the DB connect using the driver So I say driver is MySQL light or I say driver is MySQL And so you connect the driver to a data frame and then you get a connection and then on this connection to the database You can you can issue SQL queries, which means that if you are very good in SQL Then of course this will help you a lot because then you don't have to write R And you don't have well, you don't have to do for loops and these kinds of things to select your data or to make subsets And you can use the SQL language directly If you want to connect to an existing SQL light database, which is on your hard drive and then you can say DB connect Use the driver SQL light or another driver say the DB name is system.file Data and then the database name So here I assume that you have a folder called data and in this folder called data You have your database file name And so you have to just specify that and then you get a connection and then this connection allows you to directly query Through this database file and so it it has certain advantages And have one of these advantages is that you can use the SQL language directly with R And so you can say well create a new table or create an object Insert into select from update in this database or delete from this database and of course head. There's a whole There's a whole bunch of things and whole bunch of examples And so following this link and you get an overview of what you can do and how you can do this But if if you if you know SQL then of course It's it's much easier to use SQL directly from our then to learn our Extensively to do things like subsetting your data And so what are the advantages? Well, if you know SQL you avoid some of the complexities of our commands for selecting and merging different tables And one of the main advantages of doing that is that the memory management is much better When we were talking about our I told you or when we had the R lecture I told you guys that everything in R is in your memory is in the memory of your computer So if you have a table or a matrix and this matrix has a Thousand columns and five million rows and then this when you load this table into R It will all be loaded into the R memory So it will take up eight gigabytes or nine gigabytes of memory And of course if your PC doesn't have this amount of memory and then it will just fail with an out-of-memory error But the nice thing about using this this SQL package in R is that Nothing gets loaded into memory Everything just stays on your disk. So if your disk is like one terabyte big and then of course you can have a one terabyte One terabyte of data on your hard drive and you can use SQL to directly query from this massive table or from this massive database the stuff that you want and There's no need for full data subsetting and you can select subsets of your data using SQL And then only the subset get transferred from the file into the R memory So the database driver is much smarter in handling the memory or the random access memory availability Then R is and also when you are querying multiple tables Hey, imagine that you have like five or six different tables, which are all constrained using pride of using Foreign key constraints and then of course when you merge these tables into one It is much faster to do that using the database driver because the database driver is written by very smart people Which of course they they optimize much more and then what? R can do head because R is really there for being a general purpose language And while SQL is focused on Performance had doing doing queries as as quick as possible and being allowed or being able to merge things together Much easier. So there are many advantages of using the R SQL light package in R Have one of them is memory management And the other one is that you can merge tables very easily and quickly Using the database driver, which is optimized for these kinds of queries and optimized for these kinds of strategies All right So when you when you search a database then of course have based on which kind of database you are working with you can Have different ways of searching has so the standard way of searching is more or less like a Google search Where you just input a search text and you get back things which match to your search text or are very close to that But of course when you are looking at biological databases head then there are many different ways that you can search and one of the Ways that you might want to search is for example based on sequence Yeah, I might have a piece of DNA sequence and I might want to know In the database give me all the sequences which are very similar to my DNA sequence that I have here Hey, you can also search some databases allow you to search motif based Which means that you can kind of define a motif say that at the first position There should be an a But there can also be a C at the second position I require that there is a T always at the third position I don't really care So it can be an a a C a T or a G and so a motif based search allows you to kind of specify What kind of a sequence you want and many motif searches allow you to search through the database using either a DNA motif or a protein motif Some of the databases which are out there, which allow you to search through Proteins also allow you to search on a structure based here So it allows you to say well, I want to have a protein which has two Alpha helices followed by a beta sheet Followed by for example a an amino acid which contains a cysteine and which can make cysteine links to other proteins And so protein databases also often allow you to search based on on primary or up based on primary and secondary Protein structures has so say well, I give me a Protein and like I said, which has two alpha helices a beta sheet and then an amino acid which codes for cysteine One of the databases that we already saw is the database where you can use mass spectrometry data And the mass spectrometry databases, of course, are organized or are searchable using a mass based system So mass over charge and you can say well, I've measured this peak and this peak is at 19.337 mass over charge units What is this? Yeah, of course searching a database do they provide a bulk data download and do they provide an API and these are always questions That you have to keep in the back of your mind because of course Hey, you don't want to go and have to click and type and click and type and and download all the data And then put it in an Excel sheet while you are doing that because that is of course very error-prone So using bulk data retrieval or using an API will will be much better because they don't really have An error rate while a human copy-pasting things online. It will have a relatively high error rate All right, so some important databases that are out there. I just listed them here So one of the main databases that I think is very important for biology but for Biological research in general so not just biology, but also things like medicine is of course PubMed So PubMed, I think everyone knows the database or should know the database and if you don't then I would definitely Force or not force you but they advise you to look up PubMed PubMed is the Kind of entry point for scientific literature So if you want to have literature, hey, you want to know what have people published about the Bracatu gene In the last 10 years then of course PubMed is your entry point and you can get an overview of Scientific publications peer-reviewed which have been published in in in the last like 25 years And that that's what's stored in PubMed So the EMBL ensemble website stores DNA. We've already used it a lot of times and it is it is built up on DNA sequence so every animal has its own kind of reference DNA sequence and all of the data in the ensemble base is Kind of linked to this DNA sequence A similar database is GenBank so GenBank is similar to ensemble It also is a DNA database and is also structured in in that way. You have the DDBG Which is another DNA database and all of these contain a lot of data which is overlapping So the mouse genome sequence is an ensemble. It's in GenBank, but it's also in the DDBG But of course each of these databases have their own kind of specialty their own reasons for why you want to use it And they're kind of unique selling feature Protein databases are also many many many So at the most important ones are Uniprot, Tremble, Peer and PDB So PDB is more or less the oldest database on the on Well, not on the world, but the oldest biological database So Uniprot and Tremble are focusing more on DNA or on protein structures And while the PDB is very focused on protein function, protein domains And you have Peer which is a very good protein database as well So just a small overview and then you have the NDB which is focused on nucleic acids and Changes to nucleic acid So it's an interesting database have for when you are studying things like RNA or mRNA and modifications to this Of course, there are many many many more interesting and important databases like ProSight, Payfarm And so these are protein family databases where there's links or they try to summarize proteins into different This they try to summarize proteins into different families has so based on features that they share You have the DSSP and the HSSP which are databases for for DNA and DNA motifs We have DB AST and DB SNP. Those are variation databases where single nucleotide polymorphisms are stored and their relationship Within different populations. So if you want to know like does this single nucleotide polymorphism occur more often in African-Americans compared to Asians or Hispanics had then DB SNP can help you answer those questions Cag and reactome. We already saw them in previous lectures and these of course are databases which are structured in a way where you have Proteins or enzymes have working on metabolites and creating these things together Or and and holding these things together, but that we we had a whole lecture about cag and reactome So I think that most people are kind of familiar with it now You have the string database Which is for I don't know. Let me search that string DB Functional protein association networks, so it's a database which contains Around five to six thousand organisms and it contains around 25 million different known proteins And it has interactions between these proteins So if you're interested in protein protein in direction, then you can use the string DB So if you want to know if this protein is in a protein complex together with other proteins You can use the string. They've a string. They all mim is the online Mendelian inheritance in men database So all mim contains all known Mendelian diseases And an overview of these I think we use the all men database If we didn't then just let me know and then we can get back to that But if you're interested in Mendelian diseases and which genes are causing these in humans, then you can use omem So omem is focused only on humans Then you have omia and that is Mendelian diseases in animals And it's the same database more or less same database structure, but it focuses on the On on animals So Mendelian diseases that are in animals that are also found in humans or not found in humans And of course head. This is a very good database when you are interested in does my phenotype have a single gene causing it If so head then you can use these databases to get all the literature attached Not only that but you can also get all the And you can also get the genes involved and their interaction partners and these kinds of things Then there's the animal QTL database. We gave a gave a lecture about QTL Of course the animal QTL database is there for animals and plants So it's it's mostly mostly focused on animals But they also have like some model plants in there and this is a database which stores all known associations That have been previously found and so if you find that a certain region of the genome is controlling your phenotype for example, the The milk yield of a cow head Then you can use the animal QTL database to see if this same region has been Implicated in other breeds and you can use to you can use it to to see if there's other phenotypes like mastitis or Protein content in milk that are used that are that are mapping at the exact same position So oh, I just just noticed that I have a new follower Jungling, thank you for following me And Welcome to the stream as well So had the the animal QTL database contains all QTL or more or less all known Associations and of course they provide links to literature and they also provide things like how Significant was this association? And when was it found and these kinds of things? That's a nice animated gift Jungling just for you if you want to put your mood in the mood box Then you can use like capitalized words and hey you can you can you can change your you can change your your status Like if you're angry or if you're confused and then it will update your your emoticon in the mood box above me But thanks for subscribing alright not subscribing, but thanks for for the follow. Alright, so just quickly run through some of these I think most of you will know what what PubMed is hey PubMed contains scientific literature So you can search PubMed PubMed has a very advanced searching system as well Yes, so you can say give me all the publications written by Denny Arons in the last five years and sort them by the amount of citations that they have in these kinds of things so NCBI NCBI is generally the starting point of all biological knowledge and information And so hey it is built in tools like blast and clustering and automated data retrieval Hey, it's very very similar to Ensemble so ensemble NCBI there. They're they're not the same database because there's different groups behind there That maintain the data, but they are very similar in what they do So it's the DNA database and as a starting point hit if you if you go to NCBI then you can see well It it stores data on on bio essays data and software Hey, you can look at genes and the expression of genes you can look at things like homology So does this protein occur in other in other species as well? Hey, you can do sequence analysis You can look at variation and of course all of these things are often linkouts to other databases where the data is stored But they provide like this starting point. It's like the Google for biological data Yes, so the nice thing about the NCBI is that they have something called G query So if you if you don't know exactly what you are looking for you can use G query The link is here and it just allows you to do a free text search across like hundreds of different databases So hey if you if you want to know well, what is known about the PPR one gamma gene and then you can just fill in PPR one G and you'd click search and It searches like dozens and dozens of databases and just gives you linkouts to all of these databases where your search term is found The NCBI search can be a little bit tricky to start off with but there are basic search options Like Boolean operators so you can say well give me all publications by the author Denny and Arans and because if you would just say Denny Arans, then have both terms are used Independently so if you want to kind of merge do things together using Boolean operators Hey, you can use things like and or and not and you can use Parenthesis to change the priority. Yes, so I for example want everything known about the DGOT one gene and The paper should mention boss Taurus or boss Indicus, right? So the and is is transitive and based on the brackets you can kind of do your You can do your queries and the default is to search in all fields, which is generally not what you want Hey, you can you can specify you can click on the advanced search option and then you have a search builder Which allows you to very specifically query several Columns in their database as so you can say the author needs to be Denny Arans had Abstract of the paper should contain the worth DGOT one Had the conclusions or results should contain this sentence or this this section Hey, you can search by date range, which is also really handy because often and when you are writing a publication You want to know that what is recent and what has been known for a long time? And so for example, if I if I want to know that what was published about the DGOT one gene in cows so in boss Taurus and was published between 1st of January 2008 and the 10th month of 2011 and then you can just specify the date range and you can use these square brackets to also specify in Which column of the database you want to search? So this only searches in publication date And so this this third term is not searched in the abstract So if the abstract mentions 2011 then it then it will not find it because it will only show you publications which have been Published in that date range And so it is hey, you can you can have pre-selected searches and there's a popular search limit But there's also things like has special cases like the author names or the database IDs that you can search in and you can use Wildcards and query truncation, which is also really handy Hey, because you can for example say show me everything which is known about DGOT one and Boss star and the boss star needs to be searched in the organism field, right? So this matches things like boss Taurus boss Indicus And all of the other more or less cow species which are out there. So it will also match boss Musculos or something has so but that's the that's the thing that you can do And so NCBI it is a very very detailed query engine And there's actually like a very nice help file online Which describes exactly how you are able to find the the exact things that you are looking at Which is really really good So it has index fields, right? So hey, it provides faster more comprehensive and flexible access like hey It thinks you can search by publication date by organism by gene name Hey, and of course this this is very useful because you don't have to if you would just Google something Then Google doesn't really care if the name that you are searching for is an organism or if it's a publication date Hey, it will just search everywhere. Hey, but you can build very very specific queries using NCBI And especially when you're doing literature research, right? And you want to write a literature review and then of course the literature review is made much much easier by just Being able to query NCBI in a good way And so you can query then you get an overview and then you say well I only actually want papers which are based on cows or plants or fish And I only want to have gene this gene name needs to be in the database record And so and of course hey, you can you can follow the help because they have if you go to NCBI Hey, they have the training and tutorial link on their home page and this will just bring you to How to properly use the query builder and also how you can do these queries not on the website But directly from things like are Really really useful really good database to start your search for for information NCBI is really nice because it also provides a lot of downloads. They have this at this download site So you can go to head download comprehensive datasets via FTP So they have a FTP server where you can download stuff and you can download like whole genome annotations of all And so you can say give me all the the genes in mouse Hey, so it has a build query retrieval system Hey, so if you don't want to go via an API you can also just do a batch query retrieval via FTP Which is really nice and really useful All right, so ensemble is very similar to NCBI It's a genome browser or at least it was it started as a genome browser And but it supports a lot of things. Hey, you can do comparative genomics You can look at the evolutionary tree of proteins or the evolutionary tree of things like mRNAs or or link RNAs Hey, and you can also look at sequence variation like we did today and by looking at SNPs in a certain gene And it also has linkouts to things like transcriptional regulations. So hey, if you want to know Which promoter enhancers suppressors are Regulating the expression of this gene and then you can all find this in ensemble on the gene page And it has linkouts to to other databases But also on on its own database. So head ensemble annotates genes You can use it to compute multiple alignments. So to do multiple sequence alignments You can do regular or you can predict regulatory function of certain parts of the DNA head So does this DNA sequence? Have the ability to bind a certain protein It collects disease data So have once there is an association in the database saying that the braka to gene Causes breast cancer and then that that you can also display this information on top of the genome and see which regions of the genome are associated with which phenotypes It provides things like blast and blood and biomark and the thing that I find really really useful in ensemble is the variant effect predictor So often You are dealing with single mutations or like larger mutations in the genome And you want to know what the effect of this mutation is on the protein structure or on the protein expression So the variant effect predictor allows you to fill in the mutation or the Variant that you found and then it will predict the effect of this variant on protein structure And it will also tell you if this variant is deleterious or if it's accepted Or if it will change like the the binding site And so it is a very good starting point to look at very different specific variants that you have and of course the main Selling point for ensemble is is that it's connected to all kinds of different databases Yes So if it's not in the ensemble database, then they provide a link Where you can find your data or where you can read more and of course, it's also integrated with PubMed So you can also find out if you find a variant which is you think is very interesting Then you can read all about this variant in literature by just clicking a single link And then it will bring you to to PubMed and show you the literature which is associated with these variants one of the things which is really nice and it's not not so important for this course right because it's bioinformatics of plant and animal sciences, but ensemble is also the home of the encode project. So The encode project is one of these massively expensive projects Which is on par with the human genome project, right? The human genome project is considered one of the great achievements of our time that we sequenced the whole human genome The encode project is even better because the encode project is Is a project which was done by multiple international collaborators and their goal is to build a comprehensive list of all Functional parts in the human genome, right? Like when when people talked about DNA like five or ten years ago Had they would talk about things like junk DNA, right? DNA which does not do anything and just is there But what the encode project did is use all kinds of very expensive techniques Have which normal research group don't have access to and what they did is they just took samples from humans and applied all of these novel techniques like DNA is one and and doing methylation studies and they They went through the human genome and try to find what every little part of the human genome did and so they have a list of all functional elements in the genome and Including elements that act at the protein and RNA level and all regulatory elements that control cells and which Circumcensus a cell is active. And so if you want to know Hey, is this part of the DNA? Responsible in humans because it's only focused on humans Hey, but does this have an effect on for example the expression of another gene? Hey, does it have an effect on metabolites? Then the encode project is your go-to place. Not only that, but the encode project itself is very Is in a way is kind of the starting point for all other groups in the world like working on mice or working on plants or working on fish Hey, and what they do is they they they kind of link their data To the encode project. And so the encode project Gives you the ability to use the variant effect predictor and most of the data or the predictions from the variant effect predictor Are based on data which was gathered by the encode project Yeah, but that they also look at methylation of DNA and all of these other things had to see for example, if This hox gene is expressed and what happens with this cell? Hey, does it for example divide or it does it does it differentiate as all of these questions can be answered? And it's it's it's one of these projects which is just as big as the human genome project But many people don't know about it even in biology. It's a relatively well not relatively unknown and because Many people do know that it exists, but it is a a an encyclopedia which contains information of all DNA elements That's where the name stands for So the encode project has produced genome-wide data for over a hundred different cell types for investigating different aspects of genome Regulation so like I told you they did they did chromatin structure. So 5c or high C had they they looked at Open and close chromatin. So where's the DNA accessible within a certain cell type? Yes, so if you want to know if a certain cell type like Red blood cells well red blood cells are a bad example because they don't have DNA But if you're interested in neutrophils or you're interested in granulocyte said and you can see well What part of the DNA can granulocytes and theory use to express proteins from right? They look at histone Modifications and they have DNA binding of over a hundred transcription factors done by chip sec experiments over a hundred different cell types And so there's like many many data points in there and they looked at RNA transcription using RNA sec and cage And so there's a lot of information there, which is very very expensive information for an individual group to obtain But which they provide freely in the hopes that this encyclopedia will be useful in figuring out how to deal with cancer or how to deal with other Diseases and help people that are working on these diseases The DB snip we've been talking about the snips a lot and snips are kind of the Main variants which we can observe and easily measure and the DB snip is it provides information on single nucleotide polymorphism So if you find a new snip for example in mice Then you can go well not to the DB snip because you found it in mouse But the DB snip is kind of the original database they used to store up until 2019 any species But then at a certain point they decided well the data is becoming too big So we will focus only on humans, but DB snip contains human single nucleotide variants They contain micro satellites small-scale insertions deletions in the genome and they have links to the pop to the Publications in which these were found and not only that but they they look at population frequencies, right? So you can see that oh this snip is is very commonly found in African Americans But it's rarely found in Asians or in in people from European descent Had they look at molecular consequences of these snip and had they do and genomic and reference seep Reference seek mapping information for both common variants and clinical mutations The DB snip So have when you read about snips in publication They usually have an RS ID, which means that this is a reference snip If you are working in other species since 2018 You can submit your data or you can retrieve your data from the European Variation Archive So the animal data was more or less moved from DB snip to the EVA So the EVA holds the information on on snips in cows and plants and fish Snips in publications are usually referred to by their RS ID and of course have when you when you do things like Sequencing and you find new snips in your your species of interest had then Journals require you to submit your data to the EVA if you're working on plants or to the DB snip if you're working on humans Because they won't accept Publication where you just mentioned that well we found a snip at chromosome 6 at 53,000 million base pair Disposition and this is the snip had then the journal will say no no before you can publish your publication We we require you to put your data either in the DB snip or in EVA And then you will get in an RS ID and this RS ID When the genome build gets update in our gets updated like in one or two two months or in a year Had then these reference snips will be remapped And so that it's clear where they exactly are located and what their effect is and how this snip is Distributed in different populations The PDB protein database Contains information about protein structure and function and the nice thing about the PDB is that they provide several visualizations So the three main visualizations that they provide to look at the 3d structure of proteins are the NGL Which is a fast and interactive web-based tool They provide JS mole, which is a Javascript version and they use PV Which is the new kind of viewer which uses web real to allow Hardware accelerated graphics and modern web and mobile browsers But had they the PDB the protein database it contains like a lot of information about proteins But their main focus is on the structure of proteins So if you want to look at how the SARS-CoV-2 spike protein looks then you can use the PDB The PDB also allows you to do predictions, right? So you can say what happens if in the spike protein these five molecules are deleted And what does that do for the structure and then they provide these tools to do the visualization of them And so PDB also provides a lot of additional tools like sequence and structure alignment And so you can do like pairwise sequela sequence alignments using blast or needle moments or smith-waterman algorithms But you can also do structural alignments and where you say align this structure and search That don't look at the amino acids and don't match the amino acids one by one But search for structure and so Find proteins which look very similar to mine. Yeah, so they have an alpha helix and a beta sheet Had they also allow you to look at protein symmetry, which is a relatively new field of research where It turns out that symmetry in protein is very important. So if a protein is not Symmetrical or in many cases if there's no symmetry and for example the Cell wall pour that you have and then of course this this has an influence And they also have a good database on protein structure quality So I think we talked about the new google deep learning algorithm and that of course had that That uses the PDB To see if their predictions are matching the quality of the of the protein structure. So a protein structure quality How well are have we been able to kind of look at these? proteins and how much Uncertainty is still there and that is also something that you can get for PDB So the protein sequence database is actually So this is this is not the PDB, which is a protein which is the protein database You also have the protein sequence database, which is called uniprot. So uniprot is is one of these databases where it's all about Sequence right not about structure. So PDB is really about structure, but uniprot is more about Sequence and uniprot actually is Is just the entry point like it's the web server and it contains two beta databases in the in the back end One of them is called swiss prod and the other one is called tremble So when we look at swiss prod, um, it looks a little bit like this But it is a manually annotated manually reviewed database So hey, it is kind of of the highest quality like we said Manual curation is the thing that is that is most important because that that just increases the quality Of data that you get and so here you see the amount of entries in the database So you see that between 2000 and 2010 there was a massive increase in protein protein sequences which have been produced and 2019 release actually had 561 Thousand protein entries it is updated every month But you see that after like 2010 the amount of new proteins coming in is relatively low Hey, because here of course because they are looking at sequence, but once you have discovered Once the discovery of the of dna and how it encodes that so then there was a big discovery jump Hey, but from 2010 the database is relatively stable and it doesn't add that many new proteins every time Hey, it has again many tools integrated And like blast and pairwise search and batch download The other side. So this is one of the sides of the uniprot database The other side is the tremble database and the tremble database is Is Is based on dna sequence. It's a protein database which is based on dna sequence And what what they did is they use a computer to translate Coding sequences. So had they look at a gene they take the coding sequence from a gene And then they use a computer annotation to to translate these coding sequences into protein sequences The problem here is that this is one of these kind of low Quality databases in a way right because it is computer annotated. There's not a person looking at it It also has monthly updates And in 2019 you can see that there's like a massive more amount of protein sequences stored in this database But you can also see here this big big kind of Kind of fall in the amount of sequences And this is because they figured out that their their computer annotation actually contained a bug So have in this in 2015 they actually had to remove Or more or less manually remove like around 40 million Sequences from the database which the computer predicted. Well, these are protein sequences But they actually turned out to not be protein sequences So this is one of these risks of using computer annotated data Is that when a computer annotates the data and the computer makes a mistake Then this mistake is very significant right because a lot of people Hey, we're trusting the predictions done by tremble and because they trusted the predictions done by tremble They thought like oh so this part of the genome codes for a certain protein And then like a month later They cleaned up the database and said well, we made a horrible mistake And and there were like a lot of sequences in there, uh, which were not real So um, and that's one of these risks Yeah, so if you if you find your protein in swiss broad had then always use the swiss broad data and never use the tremble data Unless you are working on a protein, uh, which is not found in the swiss broad database But is predicted by the tremble computer annotation All right. I think we should take a very short break. Um, I've been talking now again for around 40 minutes. Um Let me stop the recording