 All right, so welcome everyone. Welcome to lecture number nine of our lecture series. And today we will have databases. So I guess most people know what databases are, but I just want to focus on kind of the theory behind databases and how to set up a database. We'll talk a little bit. I made an overview slide. Well, I'm not just showing the overview, I don't know. So we will first have the evaluation, which I will put a link in chat, actually, because I didn't do that. And then we will talk about why we want to have databases. We will talk about the organization of databases, some features, things that you should look out for, types of databases, a list of important databases or biological databases, which many of them have already kind of come through in the last lectures. But I just want to give you guys an overview. We'll go to some of them. And then I want to talk about BioMart. So BioMart is a package for R, which allows you to automatically query databases, which is really important when you're doing bioinformatics, because you don't want to go to every database and type in stuff and then click on links, because if you want to do bioinformatics on a large scale, and for example, you want to look at 200 genes, and then, of course, looking at them one by one is difficult. I think I fixed my overlay, so it shouldn't crash, and it shouldn't do other things, I hope. I was testing it out yesterday quite quickly. It has to be capitalized, Commando. So small words, it doesn't recognize. You have to shout it. See? That works. So I hope it will remember it between the different, between going through the break and coming back. I actually had a little issue yesterday when I was testing it, and I think it crashed the entire OBS, so the recording software that I was using. So if that happens, then the stream will go dark. So I'm hoping that we won't have any of these issues. But this is the list for today, so let's switch to the evaluation slide. You guys, of course, can't click on the link, so let me get the link from my email, and then I will just paste it in the chat, and of course, for the people watching on Moodle, there will be a link on the Moodle as well. All right, so let me log in, and where's the link is here. So I think you can click on the link, right? Yeah, you can click on the link. So the university says that I should give you like 10 minutes to fill it in, so we'll just have like a 10 minute watching this slide and me talking a little bit, so. Oh, Commando, thank you. Thank you for the prime subscription. You actually get a custom emote. I made one yesterday, and I uploaded it. It's not approved yet, because it has to be manually approved, but you will have, probably next lecture, a really nice custom emote that you can use because you subscribe with Prime. That's so awesome. So I was really hoping that someone would subscribe and be able to use it. Otherwise, I'm the only one that can use it, so I hope you like it. I made it yesterday, so I'm not very good at doing graphic things. How many days will the link be active? Oh, the link will be active starting 24th 01. Oh, so it's not active yet? I should have tested it beforehand. No, it is actually active. So it's active from now until 7th of February. At least that's what the email said. So it will be active for some time, but we'll just hang out on this slide for a couple of minutes if people want to fill it in right now. I always like it. I always like getting feedback, and it really helps me to improve things like slides and stuff. So if there's one of these lectures where you think, oh, this lecture was way too easy, or the slides were unclear, then definitely mention that in the evaluation, because I just get an overview. Don't put in weird comments. I had one of the evaluations before, where people were complaining about the abound of pizza. And of course, that looks a little bit weird to the people putting the responses together for the evaluation. But yeah, no, your voice will be heard. And next year, the lecture will be a little bit different, and that will be based on your feedback. So yeah, if you are interested in giving some feedback, then please do. And if you've done it, just mention in chat that I'm done, like I did the evaluation. You got all fives. Wait, no, you got all ones, because fives is, of course, the worst in the German system. How do you mean the amount of pizza is low? You even have a pizza-y mode, right? Shouldn't there be a pizza-y mode? I think there's a pizza-y mode. Let me check the chat if there's a. No, there's no pizza-y mode. Oh, I should have done one pizza-y mode then, but pizza. Yeah, no, it's not even in the mood box. So I agree that the amount of pizza is really low. But like last years, or the years before, we would do the lecture from 2 to 6. So a lot of people would be kind of hungry at like 5.30. And so no pizza. No pizza for you guys. Told you, yeah, no, no, no. The amount of pizza is really low. So you can fill it in on the evaluation if you really want to. It will just confuse the people. And I actually think that the evaluations, or the summary of the evaluations, will be made available online. I think that they talked about it in the Institutstraat. So for the Albrecht Daniel Terre Institute here, they talked about making the evaluations online or openly available for students so that students can kind of look at what the previous year evaluations were and base their decision on that. So don't write too much crazy things because that might hurt next year's kind of amount of viewers, amount of people that follow the course. But yeah, it's important. And yeah, definitely be as specific as possible because then I can just take your comments and work on it. All right, so evaluation. I'm just going to give you guys a little bit more time. If anyone has a question, like just a random question and just shoot and you can answer, I can explain what's on the board if you want. Like I've been doing some teaching yesterday. Some PhD students had some issues with their own data analysis. So they were interested in having a little bit of a private lecture. So, but yeah, good. I think that that should be enough time. Like I've been streaming now for 12 minutes. We've been on the slide for like four or five. And I should give you guys a little bit more time. I never did the evaluation myself. In theory, everyone can do it. Even people that don't follow the course. But I'm hoping that we won't get trolled too much by just random viewers or people that just joined the stream and say, oh, that's interesting, an evaluation link. Let's do an evaluation link without having done any of the course. But that's one of these risks, right? If you just do it publicly online, that people do that. All right, so let's continue. Because it's not a very long lecture. So, but I do need some time. All right, so next topic, which I want to point out to everyone and especially the people who are looking at Moodle, the exam. So the exam dates got approved. It's the 4th of March at 2 PM. And you should be able to register for the exam through the Agnes system. And you can register up to two weeks before the exam. But I would definitely advise you to do it earlier. There have been some issues in the past where people tried to register like one day before the registration should close. And then they figured out it was already closed. So that's not good. So you should definitely register now. And if you have any issues with registration, then let me know. Then I can either manually register you by contacting the Prüfungspro. And there have been some issues in the past, especially with people coming from different faculties or people who come from external universities. So yeah, just try and register in Agnes. If you, for some example, don't see the exam, then I can start mailing the people and asking why that is the case. All right. Again here, I can't really help you. I've never registered for an exam. I can't even log into Agnes because I'm an employee. I don't have rights to log into the exam system, which is a little bit strange, but it's just the way that it is. All right, so don't forget, register for the exam. Do it quickly before all the spots run out. No, we don't have an amount of spots. All right. Oh yeah, previous assignments. So last week, we did primary design. I hope everyone did the assignments. I was a little bit slow on putting the stuff on Moodle. So I think I uploaded the assignments Tuesday. And I need to remind myself to do that on Friday. So if on Friday evening, you log into Moodle and there's no assignments, then just drop me an email and say, hey, you forgot to do that. And if that's the case, I will do it as quickly as possible. But just so that you have enough time to do the assignments. I was wondering, by the way, do you like me going through the assignments? Or are you guys, is your preference that you just say, well, just give us like two minutes to ask questions? Because I think it's important to just show you guys how to go through it and do the assignments. So let's do it like this. And if anyone really objects and say, well, it's nicer to do the assignments differently, so to discuss them. Because we take a lot of time discussing the assignments every time, which I think is OK. But if you're like, no, it's way, way too boring to go through assignments. It's better if you do it here. OK, good. So that's some feedback. So let's appreciate it. I think doing it here and doing it live for you guys is better as well. All right, let's open up the assignments. So the assignments were about primer design. So designing primers is a common task. So we will be starting to design primers. So let me show you guys the Firefox window. Where is it? Where is the Firefox window? It's here. Good. And this is the evaluation. And then we can just close all of this. All right, so first question is go to Ensemble. All right, that's not too hard. So you just, OK. I like to watch the videos after the lecture. Yeah, OK. So then, yeah, it's better. So I think, for me, it's OK. So it's nice to have an explanation online. OK, so go to Ensemble, question number one. How many transcripts does this gene have? OK, so go to Ensemble and look for a gene called PPRGC1A. So this PPRGC1A is actually a very well-known bovine milk gene. So it increases milk yield. Are you recording? Yes, I'm recording. Yeah, yeah, yeah. I hope so. But otherwise, we can just pull it off the stream later on. But no, I think I'm recording. All right, so let's go and search for this gene. We can actually go to cow. So we just have to scroll down a little bit. It's not caught. It's cow. And we just press Go. And then the first one is the gene that we are looking for. So we can just click on the gene. And then we can say, well, it is really slow, actually. Like Ensemble has been really slow the last couple of days. And I've also been having some issues with BioMark. So I put all of the stuff in the slides instead of doing it live. Because I was a little bit worried that BioMark would go down or would be down. And I wouldn't be able to show you. All right, so question number one is how many transcripts does this gene have? So of course, we can look here at the transcript table. So we see that this gene comes in four different transcripts. So this single gene on the genome encodes for four different proteins. So in total, four different mRNAs are made. And you can see that every one of these has a slightly different amount of amino acids encoded. If you would not go down, no, it's still loading. But if you would go to the transcript table, then you can see actually how the transcripts are encoded. But question number one, answer is there are four different known transcripts of this gene. All right, so question number two, go to the variation table and look at the snips found in this gene. How many snips are found when excluding the snips in infrons? All right, so we go to the variant table. So the variant table is located here. And again, it's loading very, very slowly. So we'll just click the link again and just have to wait. I hate this. You should never do a live demo, and especially when you're doing websites and stuff. All right, so we just have to wait until the transcript table loads. And of course now we can remove the filter that I had. So we just say consequence all. And then I could say now, well, I don't want snips in the infrons because in theory normally when you look at genes, you're only interested in snips, which for example, can affect the amino acid structure. So every individual will have snips and they can change the different variants that there are, so there's four different variants. So each of these can have amino acid changes. So we just go to consequences all. And in this case, we just want to disable the snips which are in the intron. And they should be called intron variants. So when we click on that, then they are off. And then we can just press apply. So and now we see that they are gone because we have selected 26 out of 27 possible variants. And then of course we can go down to the table and all the way down, down, down, down, down. And it doesn't mention how many there are. So again, here you can see that when you really want to do bioinformatics, it's of course better to load it into R because then you can just load in the table and say how many rows are there. But in theory, if you want to figure out how many there are, then you can start counting them manually or you just say download the table. So you can just download the table here. It will start working, it will give you an Excel file and then based on the Excel file, you can just open it up and then you can see how many variants there are. So let me do that, file has downloaded. So in total, there are 3,213 SNPs which are not intronic SNPs. So let me show you that because I can show you my Notepad++ window and then I have to disable Firefox. And so I'm just downloading all the different variants, right, which are there. And then you can see that the first line is the header so that doesn't count. So we start counting from two. So in total, there are 3,214. So we just subtract one for the header. So that means that there are 3,213 SNPs in this gene which is a lot. But of course, like SNPs can occur in many different breeds of cows. So every cow will have slightly different SNPs or every breed will have different SNPs. So in theory, that makes a difference. All right, so there was question number two. All right, so let's just go back to Firefox. All right, question number three is download the genomic sequence plus or minus 600 base pairs around a certain SNP. And the SNP should be annotated as being a splice site SNP and save it in FASTA format to your hard drive. So this is what we did last time, of course, because if we want to design primers, we have to have a target sequence which we are looking at. So let's go and find this SNP. Let me copy paste the name so that I don't have to type it in. So we just search for this SNP and then you see here that this SNP is located at this position. So there's two ways of going in because you can just click on the variant. Then you go to the variant page. So we have to wait a little bit. But here you can see that it's on the primary assembly. It's a chromosome six. It's like 43.4 megabases. And then we can just get the data from here because we can just go here, upstream and downstream sequences, right? So we can just click on the flanking sequences and using the flanking sequences, we should be able to specify what we want. So here we see the flanking sequence. The SNP should be here. So the focus variant is here, the W. And now the question is, does it really give you 600? Well, probably not. So we just remember where it is and then we say, oh, download sequence, right? So we can just say, well, we want to download this sequence and here we want to, the length of the sequence to this place is 400, 500 or 1,000 up and downstream. So we can't use this interface, which is I think one of the big things in Ensemble, which is a little bit annoying. So what we have to do is we have to really go to the primary assembly. So we go to this location and now we can use the standard export data. So when we export data, right? We get the location. So we are looking at the SNP location. Let me check if that is really the correct location because we want to be at chromosome six from 4, 3, 4, 0, 3, 2, 2. So it already selected 50 base pairs in front, 50 base pairs in the back, but we don't want that, we want 600. So we just say, give me this location, right? So the exact location of the SNP and then I want to have upstream 600 and I want to have downstream also 600. All right, and then I say, I want to just, no, I don't want to have the gene information. I want to just have the FASTA sequence and I want, in this case, you can skip the next question, right? The next question is to go to repeatMasker and repeatMask it there, but in this case, we can just directly use the repeatMasker, which is an ensemble. All right, so we click Next. We click Text and then we get the sequence. So you see now that it has repeatMasked some of the areas which are also found in other, which are repeated in the genome multiple times or which are found in possible contamination sources. Although I don't think that ensemble does the possible contamination sources. So this is just repeat masking the stuff where it is. All right, so you can see that every line in a FASTA file normally has 60 on there. So we can just count. So this is one, two, three, four, five, six, seven, eight, nine, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20. So that means that we have indeed 1,200 snips here. And of course, there's one more snip because we are looking at the snip and then 600 in front and 600 in the back. So we would expect 1,201 base pairs back from the algorithm. All right, so we can just save this to the hard drive. I'm just gonna copy paste it. I'm not gonna save it. And then we move on to the next step. So the next step would be repeatMasker. We don't have to do that since we already repeatMasked using ensemble. And then the next question is to use Primer3 to design primers in such a way that the snip is amplified in the PCR product. You have to use the target option to specify your snip location. So in this case, we just say, well, go to Primer3. Let me really make sure that I have the sequence. So just to go to Primer3, we copy paste in the sequence. We have to remove the header because otherwise it will complain that our non-DNA sequence is found. But you just have to do that. And then in this case, head to target that we want to target is of course at position 601 because there's 600 base pairs before the snip. So the snip that we want is 601. And we're just gonna say, well, we don't want the primers to be exactly near this snip, right? So we're just gonna say, well, take like 20 base pairs. So if we say 20 base pairs, then we have to say, well, I want 601 to be in the middle of these 20 base pairs. So then I have to go and say, well, at 590, and then I start at 590 all the way up to 510, right? So 590 plus 20. So this is a little bit annoying. And in this case, we want to have primers forward and reverse strand. So we just say, pick primers and then it will pick the primers for us. And of course here at 601, we see our snip that is of interest. And you see the little stars. So the stars indicate that this is the region which I specified as the target region. And it comes up with a snip here or with a primer here for the forward strand and a primer for the reverse strand. It gives you some overview on the top. And so you can see that the melting temperature of the primer is 58.5 degrees. So for the forward primer, for the backward primers it's 58.71 degrees. So they are near each other or near enough, right? I told you that the difference should not be more than two degrees Celsius. So these primers would work. Of course, there are alternative primers that you could select as well. And hey, the product size of the primers that it automatically picked is 241 base pairs. So if you would use these primers using a cow genetic DNA, then using these primers should give you a band at 241 length compared to the ladder. All right, so that was it. So just design a single primer. The previous assignments that we had, like I think like two years ago, we still had like doing five, six different primers. But the strategy is always the same, right? And if you do gasmers and these kinds of things, then you have to start using things like cluster omega to start aligning multiple sequences. If you wanna make universal primers, and if you have gasmers, and then you start from the protein sequence, then you use the codon table to translate back to DNA, and then you design your primers based on that. But this is more or less the way that it works. And primers are of importance because had they're used everywhere in biology. So just having a little bit of knowledge about how to design primers is of course useful. Okay, so let me go back to the presentation. And today we will be starting to talk about databases. So again, the overview, but we already did that so we can skip to the next one. So first some terminology, because sometimes people get a little bit confused when you talk about databases because they think the database is the website, but here there are very specific tools that are being used. And so the first thing that you see when you contact the database like Ensemble is the web server, right? So the web server is a computer system that processes request via HTTP or HTTPS. So secure connection and the HTTP is of course the basic network protocol used to distribute information on the worldwide web. And so a web server is nothing more than a software program which makes certain files on your hard drive accessible to users outside of your computer. And so the database itself and database is nothing more than an organized collection of data. So your file system that you have, if you're running Windows, then you have your C drive and your D drive. The C drive and the D drive are databases in a way. They don't really classify as databases because they're file systems, but in theory a file system is a database because a database is defined as nothing more than an organized collection of data. But the thing that a database brings is that the data is typically organized to model aspects of reality in a way that supports processes requiring this information. And so a database generally comes with a schema. And if I think about databases, I think about like having a software program where there's multiple Excel files in there, right? So databases are structured by tables and these tables, they have relationships with each other. For a bioinformatician, the most important thing is that a database not only has like a web interface so that the web server communicates with the database but that there is also an API. So an API is an application programming interface which means that you can use a programming language to talk to the database directly. So we will show an example today of Biomart and Biomart is one of these APIs. So it allows you to directly talk to the database from R or from PHP or Perl. So there's like this API and this API you can use for many different programming languages. And of course, since we're doing bioinformatics we are using R but in theory you could use the same thing from PHP or Perl or Python or whatever your favorite programming language is. So why do we do databases? So databases are important because they have an entity relationship model of the real world. And that means that inside of a database things are connected together in a way which model reality which allows us to store data in a way that we can ask questions about the data and also create subsets of the data that we need. And so for example, if I'm thinking about a database which contains cow data or plant data and then this database has of course the different plants in there. So you have a table which holds the different individuals that you measured or the different samples that you measured and these samples they have like an entity relationship to for example a table holding measurements. For example, measurements based on their genome sequence. And there can also be a relationship between the same entity, not just to the genome but also to the protein level. So all of this information is stored in a structured way. And the design of the database has to reflect the functionality in certain features. The database provides a way of data abstraction. Hey, you don't have like different files on the file system but data is abstracted in relationship. So element A has a relationship with element B. Databases also allow you to simplify data in a way because when you add a new individual then you only have to add in one table a new individual. You don't have to kind of directly add the genome sequence and all of the other things. Databases are there to structure data and to organize data and of course to make data searchable. And that's kind of the idea behind the database is that you have like a structure in your data which allows you to easily search through the data. One of the things that a database does as well is that it provides you with a way to do data visualization. So data visualization can be in many different ways you can visualize the data itself but the nice thing about a database is that it also allows you to visualize the database schema. So the relationships, the different tables that are in the database and the links that exist so that the primary keys and the foreign keys so that the relationships between the different tables in the database. So that's one of these things which allow which databases allow you to do. Hey, of course your hard drive does it in the same way but your hard drive is not optimized for doing like big queries and searching into files, right? You can search for files but if you wanna know give me all the text documents which have the word high on my hard drive then Windows has a much harder time figuring that out for you or Linux for that matter. All right, so the database has many advantages. Databases are often allowing you to store large amounts of data. You can access it in parallel which is really, really useful if you have large data sets, right? Like my hard drive is one gigabyte or is one terabyte or something like that, right? But if you have a database you can charge your database across multiple servers. And so if you're thinking about like an Amazon instance if you have sequencing data and you have like terabytes and terabytes of sequencing data then of course you can't hold this somewhere on a single machine because if there would be a fire in your data center then you would lose all your data. But databases allow you to access data in parallel so you can have multiple users access the same data from different locations at the same time but it also allows you to store data at many different locations across the world providing for some data redundancy. So if one data center happens to be in an area which gets flooded then of course you don't lose your data and the data is also still accessible because you can store your data at like different positions in the world. It also provides for data protection. Generally a database system will allow you to restrict access to certain parts of the database which is of course very useful especially when you're dealing with data which is for example on human genetics where it has to be tracked who is exactly looking at which data. You don't want the person that registers you for the hospital to be able to look into your personal records and find out what your home address is and the guy's starting to stalk you. So data protection, a database allows layered protection not only that but it also allows to have access logging to the database. And it provides a way of data independence and we will get back to that. So I have a slide about that. So again data organization I think we already used this slide but of course you have your client computer over here so this is you sitting behind your desk looking at a database and so of course the client, you use a web browser and HTML input forms to kind of talk to the web server, right? So you talk to the web server. The web server of course serves you an HTML page. It can also serve you like things like JavaScript and hey, you can use PHP or Pearl or other technologies on the web server to kind of build up a web page which looks really nice and is interactive but for communication with a database a web server does nothing more than just forward data and so it allow it queries data for you from the database and then presents it in a fashion that makes it easy for the user. For example, if you do a query to a database and the query is give me all the genes and chromosome one in mouse and then of course thousands of genes would come back and the web server here makes the decision I'm only going to show you the first 50, right? And if you then click the next button then I will be responsible for getting the next 50 from the database. So databases are many, you have many, many different database software the same with web server software. I actually wrote my own web server so you can use that if you want but you can also use standard web servers like Apache or and yeah, or like internet information service for Microsoft or so there's many different software tools which allows you to do the web server. Databases in general, the language that databases use are is the SQL language. So the search query language and many database software packages have that in their name. So you have things like MySQL or MSSQL or ProgressQL and these are all different implementations of the same kind of language. So all of these databases you can talk to using the same language but the advantage is that some database software are optimized for some things. So some database focuses on being as good with security as possible while others might focus on being able to store large amount of data in a very efficient way and other databases might have a focus on how these entities are more or less combined together. And of course the web server is not limited to using a single database. It can use many different databases and query databases for you to give you an overview like the ensemble page does and the information from ensemble is not just a single database. It's like dozens of databases which are query together and then the data is combined and shown to the user. And so the features of a database is that especially talking about biological databases they differ in many aspects. So a database can be focused on storing DNA sequences or DNA sequence variants. And you can have databases which focus entirely on proteins or protein structures. And you have databases which are there to store gene expression data. So the amount of expression in a gene in a certain tissue using certain circumstances. But what you always have to keep in mind is the quality of the database that you have. If you have for example a database like DB SNP which is a database which holds single nucleotide polymorphism information. So it had this database stores nothing more than just changes to the reference sequence. But the quality of this database is relatively low because it has raw submitted entries. So anyone like me or you could go to the website and add information to the database. And of course this makes for a database which is sometimes unreliable because we all know what happens on Wikipedia when there's a topic which is a little bit controversial. And then of course the same Wikipedia page gets edited over and over and over and over again. And everyone has their own opinion so they do something with the Wikipedia page. And of course this hurts the quality of the data. You have databases which are manually curated and this is kind of the highest level of quality that you can get. And an example of this is for example SwissProt. So when someone submits data to SwissProt had then an expert in the field looks at the data and says, yeah, no, this really makes sense with what we know, what we already have in the database. And only after they have curated it and for example ask questions back, like what do you mean by this protocol or what is this weird thing in your data? Only then does it get into the database and gets presented to the public. You have computer annotated databases like Trembl and Trembl is more or less the same as SwissProt. The only difference is that here all of the curation is done by a computer. So there's no human which is kind of looking at the data. No, it's a computer algorithm which decides if the data that is being presented to the user is okay or not. And so it's a computer decision which is of course kind of an intermediate level. And so the worst quality is raw submitted entries because anyone can add, anyone can change. And then you have the computer annotated level where there's an algorithm which looks at it and then you have the manually curated quality which is more or less the highest quality that you can get because an expert in the field looked at it. Yeah, of course this has all kinds of implications for how to set up a database but it's something that we won't really go into. One of the things that you always have to kind of look at is what is the accessibility of the database? Some databases are computer readable. They provide an API, the API is there to allow people to automatically query a data set from the database and then use this in a programming language but there are also some databases out there which do not provide like an API and that are only human readable. For example, some databases which when you go to the webpage there's just a JavaScript which shows you everything and of course a computer cannot really do anything with that. A human can read something that's on the screen but a computer cannot easily read it then. Of course there's also different types of specializations so some databases like I said focus on single nucleotide polymorphism and there's databases which only focus on repeat sequences in the genome and there's databases which focus on things like alternative splicing and of course you have databases which specialize in protein structure but there's many different specializations. So how do you get to know all of these kind of quality things? So when you are looking for a certain database hey imagine that I'm looking for protein sequences how do I decide which database I should use and which database is good? And so if you wanna get an overview of the existing web resources and database resources which out there because it's a very dynamic landscape every year new databases are published and come up but if you wanna get a really good overview then read the nucleic acid research paper the nucleic acid research so NAR they are a journal and every January they have a database issue. So this issue focuses on databases which have been there for a long time and what are the updates, what are the new features but it also gives you an overview of the databases which have been published more or less in the last year and there you can get some really really good overview because it's a single edition of the journal and they just focus at the databases that are out there so purely biological databases of course. So if you're looking for a new database and you're saying I'm really interested in proteins then of course you can look in the nucleic acid research at January edition and the whole January edition about databases that are available. So of course go to the database, look at it yourself and look at the release notes, look at the database statistics like do they provide the animal that you're interested in the release notes, how often do they update the database if the release notes says last updated January 2005 and then of course this is not a very useful database because like 15 years of biological knowledge is more or less missing from that database. Find out how to provide information especially if you are working in the field of bioinformatics and then it often you have to query databases so you have to know how do I query the database and do they provide identifiers, can I search by sequence and what kind of formats do they provide and of course this is also important when you are submitting data to a database in which you generally have to do and nowadays when you write a paper then all of the kind of raw data that is underlying your conclusions should be deposited in publicly accessible databases and often the journal has a list of databases that they allow and then you have to just go through and see well which one do I like the most and which one is easiest for me to get my data into the database. One of the things which is really important is a database should have an API and if it doesn't have an API it should at least have a batch retrieval system so that you can say well I have all of these identifiers and instead of searching them one by one copy pasting the results and making your own Excel table filled with data which is in the database it's always nice to have a retrieval system where you can say well I have a list of genes or I have a list of proteins that I'm interested in and you just send those as a single batch and then you get a big like Excel file back or a text file with all the information of course having an API is better but if there's no API available don't directly discard the database look to see if it has a batch retrieval system so that you can just like mass query the database without having to go through it one by one and of course if you want to decide what is a good database there's two questions that you have to ask yourself is it maintained on a regular basis so again look at the release notes if the release notes are from 2005 then probably not but if they are updating nicely every like three to six months then of course this is a good resource to use one of the things that you can also look at is how often is this database used in other publications so some databases get cited hundreds of thousands of times a year database like Ensemble, the Short Read Archive DBSnip, these are more or less authoritative databases which everyone uses in biology so if you're interested in a database and you're saying well this is a really good database and when you search for publications that use this database and you figure out that there's only like five guys in the world that ever publish the paper with data from this database then of course it's probably not worth going through this database it's better to find someone which is more which is more authoritative alright so there's kind of two types of databases so the database that we will be looking at today is the object relational database so the object relational database is kind of what I envision as being a database it's a tabular layout of data so you have rows and columns like in Excel or in other programs that do kind of spreadsheeting and you have of course and these tables are coupled together using primary keys and foreign keys and they can be queried using the structured query language so they provide like a way to directly using SQL talk to the database and get your data back another type of database which is becoming more and more popular and has been very popular in the last like five years is the no SQL database so it's the database which you can talk to using SQL which is a drawback because you have to learn a different language and they are kind of it's kind of a no SQL but people usually call them triple stores so a triple store is a database which is just entity relationship and another entity so hey you it's like a big table with millions and millions of entry but this table only has three columns and the three columns are entity one relationship entity two so and had these these are very similar to the cytoscape format that we did in lecture six there are massive advantages to using these triple stores because they allow you to do like semantic queries so it allows you because it because all the data is nothing more than a node an edge and a node you can reason about like the network much better than that you can with SQL hey in SQL you kind of define the structure of the database beforehand so you are stuck with the structure that you come up with but in triple store if you find a new relationship you can just add the new relationship to the database without having to change the database structure while in for example an SQL database so an object relational database hey you are probably having to make a new table have to couple this table to what is already in your database and that of course complicates it much more because some stuff will break if you start adding tables and removing tables alright so SQL is a database language it has four major commands so the commands are create which creates a new element like a new table or a new entry in a table it has a command to read which kind of reads back the content of a single row or the content of a single table you can update so you can update a certain entry in the table and you can delete and that's more or less what it is so it's a crud kind of system create read update delete and SQL the language itself is a very easy syntax and it's very related to English so you can say create table with this name which has these columns and then you have to specify it or select, so read select from this table these columns where the ID is smaller than 100 so the language itself is very close to English and it is supported by most major databases like MySQL, ProgressSQL there is a slight drawback because there are some very database specific dialects out there so language queries written in SQL for MySQL generally don't translate directly one to one to ProgressSQL or another database engine but the basic language is the same between all of them but some allow you to do more fancy things and these fancy things generally get described in a very database specific dialect so you have to be careful about that all right so some of the things that we've already that I've already mentioned is the primary key so a primary key also called the primary key constraint is a table typically has a single column or multiple columns that together contain values that uniquely identify each row in the table and a table by definition can only have a single primary key and so if you would look at this little example table below and then we see that in this case the product ID combined with the vendor ID make the primary key of course the product ID itself does not have to be unique there's two different vendors like here so vendor number seven and vendor number 100 had they both have a product which has the ID 609 and so this makes that just using the product ID you cannot uniquely identify the product but combine these two together if you have the vendor identification and the product identification then there is a unique product which belongs to this this combination and so a primary key generally is a single column but it can be multiple columns as well and if two columns together are able to uniquely identify a row in the table then you can set these two columns to be the primary key so the primary key is the kind of index to kind of specify one single row in the data set we also have foreign keys so a foreign key is a column or a combination of columns again that is used to establish and enforce a link between the data in two tables to control the data that can be stored in the foreign key table and so an example here is where we have a little bit of a visualization where we say well we have artists we have recordings and we have genres right so if you're thinking about music those are three main things and of course an artist recording has an album ID which is the primary key which is designated here with this little key but of course it also has foreign keys so you can only add a recording to this table when you know who the artist is and the artist is actually in the artist's table and when the genre is known and is in the genre table so it forbids you or hey it does not allow you to input data unless the foreign keys are satisfied so you can only add a recording if this would be an artist ID and the genre ID would be specified as foreign keys then you can only add a recording to the database when the artist is found in the artist's table and then the genre is found in the genre table so this forces you to have data which is consistent so data consistency is enforced by foreign key constraints and this is really useful it's sometimes very annoying because sometimes you want to add a recording where the artist is not in the art but this forces the data to be consistent so for a recording to be storeable artist has to be known and the genre has to be known one of the things that you see here is in the database description is also that you see that the relationship is given as one to infinite because they have one artist ID can have multiple recordings associated with it and the same thing holds for a genre a single genre can have multiple or in this case many relationships so this is called a one to many relationship but you can also set one to one or one to N relationship so you can say well an artist can only have one recording in the database so sometimes you want to limit the amount of kind of links between different tables so you can say well no if I have a primary key in this table and then this primary key can only be a foreign key once which allows you again to enforce data consistency and to make a database work more structured so databases work because of data normalization and data normalization is a difficult topic and it's a not solved question but the idea is that a database can have different levels of normalization so a database normalization is the process of organizing the columns, the attributes and tables so the relationships of a relational database to reduce data redundancy and improve data integrity right so if you have a database which looks like this right so we have a database which has a single table which has a name and address and an address field one and an address field two right then you will see that now all of a sudden there are multiple entries in address field one which are the same right so we can normalize the data so not denormalize because here the picture goes from here to here but if we start off with a database which is like this then we can normalize the data by going and saying no we have a table which holds names and then we have a one to many relationship head to the address right so head the address now we don't use two columns no every address line just gets entered so a name can be associated with multiple address fields in this table so if you would take a database which is structured like this and put it in a structure like this then you are denormalizing it and of course the same data is still in there twice so the data is still in there redundantly but the data integrity is now better right because if someone now needs three address fields if the database would look like this then it would be impossible someone could not add a third address field well in this structure here they can add as many address fields as they want so it allows you to be more flexible when you normalize your database of course you can also use it so normalization comes in different forms so the first thing in normalization is that you want to remove duplicate data and you want to break down data at the most granular level and then the second normal form is when a database has no duplicate data and then all column data should fully depend on the primary key and not on a part of the primary key so when the primary key is multiple columns in a certain table you don't want this to to be able to identify a row uniquely using only part of the primary key and then the third normal form of a database is when no column in the database depends on another column in the same relational table so if you want to know more about this then search for the 11 important database design rules but I just want you guys to know that normalization is the process of deduplicating data in a database and making sure that there are relationships in the database which are maintained and which are enforced so another example is the granularity of your data is that you can break your data down into logical pieces so for example, if we have the name of a student for example, that's difficult because those are all very foreign names for me but you have for example, Sham Sike you have Raju Harshing Kaurala if you want to break this down so you can have a database which only has a single column and this column holds the name of the student then it's of course better to break this down into three pieces because the way that human names are structured is that you have a first name you have a middle name and you have a last name of course you can go too far in this as well and that is called over decomposition and over decomposition is when you take something like a column or a column which holds phone numbers and start breaking this down in things like region code, area code and then the phone extension code of course it is very logical to do a query saying give me all the students who have the same last name or give me all the students which have the same first name or give me all the students who are called Raju so that is a query that you might do on a database but not very often do you run into a situation where you wanna say well, give me all the phone extensions that live in a certain area code it might be, it depends on your use case it depends what you wanna get from the database but in general things like phone numbers can be stored in a single column while things like names you usually store into two or three or multiple columns had to just have more flexibility in querying the data in the database. Another thing which is even worse is that have of course the duplicate data thing as so the duplicate data can be very obvious right but what's the worst thing for a database is that you have a free field so a field where people can put in data which is duplicate data but they write it down in a different way, right? So here you see that some teacher decided to call it fifth standard using the number five another teacher filled in some data and they use fifth standard written out of course now it's impossible to do a query where you say select from this table all the individuals where the standard is fifth standard because you now have two ways of writing the same thing and so the database engine cannot help you to automatically query this and so to normalize this data what you would do is you would create a standards table right and this has a primary key called ID and then it has a description which just says what the description is and in your student table hey you now have a foreign key towards this standards table so this enforces that no teacher can fill in their own fifth standard or make a spelling error no because the standard is a primary key the database engine itself enforces that the user putting stuff in the database cannot choose a standard which is not in the standards table so duplicate data is avoided by this hey of course there's still a little bit of duplication because the standard is still one and two but since this is a foreign key constrained and this is perfectly fine one of the things that you should never do when you design a database is to store derived data there's only one exception when the computation of this derived data is very, very heavy but I've seen some databases where people would store like the average and of course the average is dependent on the total marks and the total subject so how many points did you get and how many subjects did you do but the average is just calculated as total marks divided by the subject so of course you don't want to add a column to your database saying average because that just completely defeats the purpose of the database because this is very easy to compute and it does not really help you so when you design databases or get to design databases in the future and make sure that you never store any derived data and this also holds for programming languages in R you would also not do that when you have a data frame with data in R then you would not add a column with the averages no you would just write a function which given a certain data frame returns you a list of averages based on the data frame and so again the way that databases are structured there's a logic behind it and this logic is also in programming so all right so in R of course because we're going over time so I will take a quick break and then I will tell you some more about R SQL Lite so let me stop the recording