 everyone if you're watching this on Twitch live or later on YouTube welcome to the lecture. Today we will be talking about databases and Biomart. So it's gonna be a two hour, two and a half hour lecture and after that I will be using Biomart on one of the datasets that I've been analyzing this week because we are preparing to submit an abstract to the basic gulp conference which is a conference which is about cows and goats and these kinds of things and this conference is hosted once every four years and you have to submit an abstract on what you're talking about. So it did a little bit of preliminary work so a little bit of preliminary analysis on data that we got ourselves and that we got from collaborators and some open data. So I think that it would be good to just go through there and show you guys well what I do when I work and that the stuff that I'm teaching you is something that I do use more or less well not every day but on a weekly basis. Alright so first off happy new year to everyone. This is our new year's card that we send out so if you're important or important enough then you get to be on our mailing list and you get to get one of these really really beautiful Christmas cards. Unfortunately like did you get one Misha? I don't know are you on our mailing list? If not then tell me oh yes cool cool cool yeah they're nice right like we try to do like a theme every year of course I have to give credit like my moderator designs these like I'm not skilled enough to do this so but everyone I'm very happy new year and I hope 2022 will be a much much better year than 2021. Alright so for today like I told you guys we will be talking about databases so mostly I will be talking about things like how is a database organized what type of features does it have what different types of databases do we have and then I want to talk to you guys about how to put data into a database and some dos and don't so things like data normalization and some other tips and tricks when you're working with databases. There will be a list of important biological databases which will be a little bit repetitive right we've seen a lot of databases already in the last like lectures so in the last nine lectures we've gone through a whole bunch of databases and I just want to highlight some of the most important ones and kind of give an overview so that you don't have to go through all of the old lectures but when I then ask a question on the exam like what's name three databases which are important for protein research then you can just have one slide where all of the databases are on and then I want to talk about bio mart so bio mart is one of these tools that is really really useful it is a tool so that you can directly retrieve your data in in R so it allows you to query ensemble and other big biological databases and it's just a really really useful tool when you're working with biological data and you want to just do some annotation or if you're interested in how many genes are located in a certain region in a certain species and the nice thing is you don't have to point and click but bio mart allows you to just from our write code which downloads the stuff and it's much easier than having to go through a database and making an Excel file yourself of course like always we start with the answers to the previous assignments so it's been a while ago so the last lecture was about primer design I hope that people did the assignments because it was very short it's just designing a single primer for a single gene let me pull up the assignments so the assignments oh I didn't make a window for that so we'll just read them let me go to Firefox then because of course the first step in all primer design is getting a sequence to design a primer on so of course getting a sequence means going to ensemble and of course in ensemble and we can select which species we want to work with why does it not go to the secure site I don't like that it always defaults to the HTTP site so here we can just select our species so for the assignments we were looking at PP ARG C1A which is a bovine milk gene it's a gene which isn't very important in bovine milk production so mutations in this gene have a direct influence on how productive a cow is so we went to ensemble and then we have to look for the gene so I'm going to specify cow let me find that in the list for you guys cow cow cow cow where are you almost there there are so many different species of animals there is the cow all right and then we will just search for the gene all right so cattle gene located on chromosome 6 the location is well 43 megabases and it says minus one which means that it's on a negative strand so I hope that like we will look further into like this annotation and we will also be using this annotation when we will start downloading genes from on some from ensemble using biomark but for the primary design I just wanted to first focus on the gene so the first question was when we look at this PP ARC G1A gene how many transcripts does this gene have so ensembles being a little bit slow today but at least it shows me so you can see here that there are four different transcripts of this gene so that means that again there's a single gene on the genome and this gene is able to produce four different proteins so that's the first answer so answer number one is how many transcripts are there for okay so then the next step is go to the variation table so called variant table so let's go there as well again it's a little bit slow which is always annoying let me zoom out a little bit as well so that you can guys can more let's see the whole window that's it so now the bar is gone so the question here is go to the variation table and look at the snips found how many snips are found when excluding snips in the entrance right so here we have the variation table so have what we can do is we can just clear all of the filters beforehand and now we see that in total does it give me a total don't know if it gives me a total in the bottom on the top should give you a total on how many there are actually normally but I just can't see it it might be out of frame though normally it tells you how many there are hello Xanax and welcome to the lecture welcome welcome hope you had a good Christmas holiday so of course we want to look at the the snips which are inside of the transcript or inside of one of these variants hey welcome moderator so we want to exclude snips which are in the intro so we just want to look here and there's in-frame deletions and then we have coding sequence variants we have three prime and interim variants so we just say off right so and also here we can see how many there are of each type so we can see that in total this gene has no known variants which are causing an in-frame deletion but there are 302 genes which are miss sense variants which means that the mutation causes an amino acid change so of course like the internal variants are the most and that's generally the case right if we look at single nucleotide polymorphisms a little mutations into the DNA the thing that we most often see is that these region that these snips these mutations occur in regions which are non-coding right because non-coding regions do not have any selective pressure on them but genes themselves do right so if you are a snip and you're inside of a coding region then you have an influence on the protein which means you get a selective pressure because of course having a high amount of the protein or a low amount so interestingly enough it doesn't tell me how many there are in the table but that's quickly solved by just clicking the download button right so just click the download button if you're interested in it and it will give you an excel file and then of course we can just look at the amount of lines in the file that we got so in total there are 14,952 variants this is not entirely true of course because there are four transcripts so a variant can be in one transcript it can be in two it can be in three so hey we can already see from the table that there will be some duplication so in this case there's not this one is not because they're all downstream variants so if we just look at it slightly differently right so instead of saying consequence all oh so we have to subtract 11,000 so that means that there's 14,953 minus 11,783 but hey if we look at the stop gain variants so I want to say turn all off and then only lose at the stop gains right which are very important snips because now there's a premature end to the protein head then we can see that these variants should in theory have multiple transcripts associated with them in this case it doesn't seem to be well that's good at least have we can get an overview of all of the different snips so the question was how many snips are found when excluding snips in the introns so that's around three thousand something you can you guys can do the math on that so the question now is is we want to look at a single snip called RS 45045 this is a splice site snip so I'm going to go consequence turn all off and then I'm going to look at the splice site I'm splice site splice region let me just select all of the splice splice site snip alright and then we just want to search for the snip that we are interested in and then we find nothing so instead of doing this turn all of them on and then search for this thing all right interestingly enough like turn all on I have to click the apply button I forgot that so we just search for this snip and this one should be in there again so here we see what I was trying to show you guys earlier this is one snip right so a single mutation in the genome but it is actually affecting all four of the transcripts of this gene so that's something that we want to know right so we can see now where this snip is located right so we now know the exact position of the snip so what we can do is we can just take the position we go to export data right and then what do we want no we don't want to export the whole gene so we want to just go to summary and say export data I just want to get a position so go to region detail then we can zoom and zoom out and now when we click it we can actually just select where we want right so I save the location of our snip so I'm just going to take this location and then say select this location right and then how much do I want so in the assignment we say get 600 base pairs in front and 600 base pairs on the back of the snip and in this case we're not going to repeat mask them because we want to use an external tool for that so have we have selected our snip we have 600 base pairs in front 600 base pairs in the back we click next and then we just say text and we get our faster sequence here and of course every line in a faster sequence is generally 60 long right so this is 6 this is base pair 1 to 60 base pair 61 to 120 and so on and we can see that there's one leftover right because we select one position and then 600 so in total we are downloading 1200 and one in total alright so let's copy this and then go to the next question how long is the exported sequence so the exported sequence is 1200 and one base pairs at which location is the snip located well it's located exactly in the middle right so and because that's that's what I asked ensemble to do and normally you would want to verify like with your eyes to make sure that the snip that you are looking at which is probably this one here is actually the snip that you're looking for it's really easy to just make a small typo and that's why you generally copy paste the location instead of just writing it down and then typing it in so just to prevent making any errors alright so use repeat masker so let's go to repeat masker just repeat masker those are the results I don't want that I just want to go to repeat masker.org and I want you just mass repeats I give it my sequence then I want to say well which DNA source do I have so unfortunately they do not provide cows as a standard selectable option so we have to say mammal other than below so and of course the return format I want to get it in HTML and then I'm just going to say submit it and this shouldn't take too long so what repeat masker does is it's a very important step because when you are designing primers you don't want to design primers to repeated regions like we discussed last time because then your primers will not be unique right so unfortunately we ended up in the queue so it might be taking a little bit longer but as so repeat masking is one of these most important steps because this makes sure that we are not designing primers which are targeting repeated regions but when you look here then you can also see that it actually checks for other things like retro virus sequences and these kinds of things right so of course we also don't want to design primers into have into retro virus sequences so it gives us here the overview so there's one simple repeat in there of 24 base pairs and there's a low complexity area of around 30 base pairs and then here all the way at the bottom we have our format and we are just going to say I want to have the masked file so when I look at the masked file then we see that very close to the snip of interest it actually blocked out a part of the sequence using ends right so we can see here that there's like 30 like 60 ish base pairs that have been blanked out so these 60 base pairs should not be used to design our primer alright so let's take this then we go to primer 3 so primer 3 is where we can design our primers so we just give it the sequence that we're interested in I'm going to say pick our left primer pick a right primer and then what do I want to target well I want to target of course position 601 because that is the position where our snip is in and then I'm going to say comma one because I want to target this region with one base pair surrounding it I don't want to exclude regions or these kinds of things so this is very useful when you start designing like multiplex primers or other more complicated primers and then we do want to start excluding regions or having very specific target regions where we want to put our primers in right so the exclude region is where do I not want to have a primer an include region is where do I want to have a primer right so in theory we could say I want to have an include region so I want my primer to start at least 30 or 40 or 60 base pairs before 601 and of course this matters when you start sequencing because when you start sequencing then the first like 30 to 60 base pairs generally are very low quality so you want the piece of DNA that you take out of the animal or take out of the human if you are planning on sequencing it you want to have your primer like a hundred base pairs in front so then you can use the include region to make sure that you get a primer there or you use the exclude region and you exclude everything from 500 to 700 base pairs anyway this is all the settings that we need to do because we're only interested in making a primer quick and dirty so we're just going to say pick primer and then it very quickly perks a primer for us so if we zoom out a little bit we can see that the target snip is here at position 601 and then here we see our first primer that was picked and then we see here the reverse primer that was picked and we can see here that the TM's of the primer are very close together there's only 0.4 degree Celsius difference so these primers should work very well together you can see that there's no hair pins or no other things which are influencing the primer of course the more or the more difficult your primer design is like if you're trying to design a primer for like all four transcripts at the same time but these transcripts have slight differences to each other and then of course it becomes harder and harder for the program to pick proper primers for some reason the length of the right primer is only 18 base pairs no idea why it didn't pick a 20 base pair primer but in theory if we don't like the primers then it also suggests some some backups so let me see use primer 3 you have to use the target option so that's what we did so that was the last one so these are the two primers which are recommended by primer 3 and it gives us four backup alternatives in case we don't we don't like the primer that was picked good so those were the answers I hope they were not too hard it was just like going to database downloading your sequence going to repeat masquer and then repeat masking out just to make sure of course one thing that we need to do here we see it's a T right so we can go back to ensemble and then go back to our SNP right to just make sure that we have the right one because of course we want to make sure that this SNP is actually a T to something SNP so when I search then indeed we can see that this SNP transforms the T so some animals have a T and some animals have an A so but the reference is the T which lines up with our primer 3 output so we did a good job we picked a proper primer so that's perfect good so those were the assignments let me switch back good good good good so for today databases like what are they how are they organized what features do they have what types of different databases are out there I want to talk a little bit about data normalization because that's an important skill when you're designing your own database but it's also an important skill when you're kind of collecting your own data so the kind of rules for data normalization don't just apply for databases they also apply for text files on your hard drive and then the list of important biological databases just so that you have two slides which mention all of the main databases and most of these databases have already been discussed in the previous lectures and then bio mart just because it's an amazing tool that you can use to download data directly from ensemble into our and you can search and query so it's it's one of these tools that you use a lot when you're a bioinformatic and like I said we will have a small example on the slides but if there's any time left and I will just kind of program for you guys live some bio mart stuff because I have to do it for the basic all conference anyway so why not just have you guys look in while I'm working good so a database I first wanted to start off with some terminology I think this is a little bit of an a duplicate slide because we already had terminology on web servers databases and apis but a web server is a computer system that processes request via HTTP which is the protocol that allows computers to talk to each other and had this is the basic protocol to distribute information on the worldwide web although no one calls it the worldwide web anymore so the web server is one part of the equation generally right because when we go to ensemble then we see the have we send a request and then we get a web page back but of course ensemble itself is nothing more than a big collection of databases behind these web servers so a database is an organized collection of data the data is typically organized to model aspects of reality in a way that supports processes that require this information right so a database is is kind of a generalized structure of reality so it tries to model reality in a way but it also tries to have this modeling process in such a way that it that you can easily query it and that you can easily find what you want to find and then an API is an application programming interface an application programming interface is nothing more than generally a web server which allows you to do direct queries to a database but API's are used in in many different terms or as a term it's used quite frequently in a lot of fields but API's are things like bio-mart right so bio-mart is an API which allows you to query the ensemble database and download data on large scale so why do we use databases well databases are like I told you guys an entity relationship model of real-world processes so a database allows us to kind of beforehand say well the world is structured in this way but it's a very limited way of of describing the world right so had the design reflects the functionality and it has the following features so one of the features is that data is abstracted so had there's a level of abstraction in your data have because you're saying that well for example we have a person and a person has a home address but a person also has a phone number and a person has like things that they like right and and you put persons into one data table the phone numbers go into a different table then you have a table for addresses so had this is an abstraction and it allows you to to have like a kind of structured overview of what you can put in databases provide a way of data simplification because like things data that is common can be deduplicated so you don't put in duplicate data into a database no you just have a table and then you refer from one table to the other table and that means that you just kind of have a link right so you don't have to write down the same street name like a hundred times in the same table no you just have one table which is person you don't have a table which is addresses and if two people live at the same address then they just have a kind of pointer from the person table to the address table so that allows you to simplify data and remove duplication it allows you to structure and organize data in a in a predetermined plan and it also makes data searchable it also provides some level of data visualization especially for the database schema so the database schema is the way that is a description which shows how the data is organized and this this this schema of a database can be visualized and can give you some insights into how you have modeled the entity relationship model so there are a lot of advantages of using databases a lot of people say well but your hard drive in your computer is also a database that is true but databases provide much more than just a basic hard drive right so one of the things that databases are really good at is storing large amounts of data and making subsets of that data right so instead of having to go through like one big file on your hard drive trying to find the information that you're interested in a database can do that for you right so a database stores all of this data but when you ask a question it only gives you the data back that you that you are interested in you can access it in parallel and this is really really useful because like if you are working on and so something like ensemble literally has millions and millions of users and all of these people are asking questions at the same time and the nice thing is is that because databases allow you to access things in parallel all of these millions of questions can be done more or less simultaneously right and you don't have to wait for some other dudes process or question to have finished before you can do it and that is different from hard drives hard drive also have some parallel access but databases can be like large-scale parallel even across multiple locations right so data redundancies also something that a database gives you data redundancy like if you look at the Amazon web servers then had a website like Facebook is not located physically in a single data center but Facebook itself is more or less distributed around the whole world and it has like hundreds of data centers each of these data centers make it so that the data is stored redundantly so if someone is crazy enough to blow up one of the data centers from Facebook then Facebook doesn't go down because another data center will just take over serving the requests it also provides some data protection things many databases provide like integrity checks for data so a random like muon which flies through space and hits your hard drive will sometimes change one or two bits on the hard drive and of course a database can detect that data has changed and can repair data and it provides data independence as well because it liberates the data from the format so that means that you can you can query data without knowing the format in which it is stored so if the guys at Facebook or some other company decide to change the way that their data is structured then you as a user of this database are generally not bothered by it because the question that you ask is still giving you back the same data and the question generally also specifies the format that you want your data in so it allows people to change the way that they store data easily without the user being bothered by the way that the data is stored so if we look at database features then since we're talking mostly about bioinformatics databases and not so much about Facebook databases then the databases that we are interested in differ in many aspects right we have already seen DNA databases so databases which only contain data on on the DNA level for example DB snip is a database which is involved in a single nucleotide polymorphisms and this this database of course only stores DNA modifications we have data we have databases which focus on proteins like PDB but we also have gene expression data right so a database which is only focused on storing gene expression data when you are looking at a database you have to make sure that you are looking at the quality of the data into the database right so there's a big difference between a database which provides raw submitted entries like DB snip so these are not curated at all anyone with a computer and an internet connection can add data to this database and of course that makes the data shaky in a way right because anyone can put any data in there you end up with kind of a situation like Facebook and news right so anyone can post on Facebook so the quality of the stuff that you read on Facebook you have to really be careful about then there's other databases which use computer annotation like tremble we will talk more about the tremble database but this is a database which uses computer or computer algorithms to annotate the data that's in there and this provides kind of a medium level of quality right so because the computer is not a human so it has no knowledge about current biology or current like new trends or something but had the computer annotation is just nice but it's not 100% reliable right because it's based on some algorithm that someone wrote there can be like bugs in these algorithms and we will see one of these examples where a computer annotated database went completely wrong and that means that a lot of people thought that the data that they got from the database was actually reliable but it turned out not to be so reliable is that people thought so the highest level of quality is when you have manually curated databases so for example Swiss Prott is one of the database where where they have protein information and the people at Swiss Prott employ data curators and those are people that look at the data they are people that work in the field and they read the papers and they make sure that the data which is in the database is exactly the way that it is described in the paper right so here there's a human looking at the submitted data and that just gives you more reliability of the data right so it's an expert in the field curating the data which is completely different than a raw submitted entry. Two types of accessibility some databases are only human readable which of course is a drawback for bioinformatics because in bioinformatics we want to have a database which is computer readable right or at least which presents us with an API so we can directly query the database and get the data out that we want. But there are databases which are only human readable. And of course there are many types of specializations right and DNA database can specialize on single nucleotide polymorphisms like dbSNP or on repeated sequences or databases which only look at alternative splicing of genes. So they look at the different transcripts that come from a single gene product. So some general advice when you write a paper in the future and you have to use databases for data sources there is a there is the nucleic acid research paper or journal. So this is a journal and every January they have something which is called the database issue. And because it's a very dynamic landscape some databases are there forever like ensemble it's been there since like 1994. So it's a very reliable database and it won't just go away but other databases might be discontinued because funding ran out or other issues were detected or generally it's funding running out so that people can continue the database. And new databases also pop up right. If we think about things like micro RNAs we know now that micro RNAs are very important but like 30 years ago no one would have made a database for micro RNAs because people weren't they weren't even discovered yet. Right. And the database issue which is always in January gives you a really really good overview. So if you if you want to know what's hip and happening in the database world in biology of course then take the January issue of nucleic acid research and of course for people who design databases and make databases it's always a big push to get their database published in this January issue of NAR because it has a very large audience. All right so get an overview of the content have you have to of course when you pick a database to do your analysis with look at the release notes look at the database statistics and these kinds of things had to get an information like does it provide the animal that I'm currently working on is the data easily accessible easily queried can I use things like API to download my data had the release notes if you see that the latest release was five years ago then of course like this database might not be the most up to database that you want to use. Of course you have to find out how you provide information to the database right because when you're searching for some data then you have to make sure that you have the right identifiers right. If I'm working with genes then every gene has an ensemble gene ID so that is the name that ensemble gave the gene but other databases like Entrez they use their own naming system so if I have my data annotated using ensemble IDs then of course I cannot use a database which requires me to input Entrez IDs and there's a lot of tools allow that allow you to go from one type of identifier to another type of identifier but it's important for you to know how to query data from the database. You want to know things like sequence what kind of formats do I need to provide or can I provide and the main thing is of course as a bioinformatician I am always interested if there is a batch retrieval system because I as a bioinformatician am not going to go to a database and start clicking around and downloading data making my own Excel file. No I just want to be able to directly talk to the database and retrieve half of the data which is in there and maintenance is also one of these things is it maintained at a regular base because a lot of databases they get published they become very popular people use them for a very short time very intensively but then a new beta database comes up which has better algorithms in annotating the data and everyone switches from one database to the other one so if you are trying or if you're thinking about using a database make sure that you see other publications and that other people are using your database in recent publications right because it might be that you spend a lot of time writing your code to download data from database A but database A was the one that was hip and happening in 2018 but in 2022 we use a completely different hip and happening database so if it is used by others in publication that's generally an indication that the database is a good database to use. So if we talk about databases and types of databases we have two different types of databases so we have the object relational database which is the most common type of database so it is a tabular layout of the data so you have rows and columns generally you query it using SQL so structured query language and it couples different of these tabular layouts together using a primary key foreign key relationship so we will talk about why this is but these the object relational databases are the most common one so when people talk about things like MySQL or ProgressQL they are talking about object relational databases. Since like five six years there is another type of database which has made its entry which is the NoSQL database or they are also called triple stores so these are databases which are not using tables they are databases which just have triples in there very similar to the site escape format right so you have object relation object and you have millions and millions of entries and that is how they store their data and the nice thing about this is that you can do semantic queries because the semantic queries means is that you can ask things the database can better understand what is in the data in a way so a semantic query an SQL query is select from this table everything where this column is having a value lower than x but a semantic query would be something like show me all of the things related to a certain concept right and a triple store can do that because it has it knows what the relationship is between two objects right good so it can it can reason about objects as well it can know that if a is connected to be versus a certain relationship and be is connected to see via the same relationship then a is also connected to see via be right so it can it can it can see the data in a way so but I don't want to talk too much about no SQL and triple stores because they are not use that much but they are very important or becoming more and more important in bio bioinformatics research and that is because they can be really really fast blazingly fast in relationship to standard kind of classical object relational databases and they they allow you to do these semantic queries so they allow you to to ask things which you cannot easily ask using SQL where you're just saying select from here these things that start with an a so SQL is the standard language for databases right and it's a very simple language and it's often called a crud language so it provides four main features actually five because select also belongs in here but generally you have create read update and delete so and and you can create a new entry in a table you can read an entry in a table you can update an entry in a table or you can delete an entry from the table it is a very easy syntax and it's very much related to English right when I told you guys select star so everything from this database from this table where these things apply right that is the way that you write queries in in SQL the SQL language literally has a select something from a table statement so it is a really easy syntax and it is related to English of course you can you can merge tables together and have like foreign key restraints and stuff but it is supported by most major databases which and even some no SQL databases provide a SQL interface to their database so one of the drawbacks is is that in SQL there are a lot of database specific dialects so my SQL it has a slightly different SQL dialect than progress SQL which means that queries that work for one database are not like one-to-one translational to another database engine so to speak so there are database specific dialects which is just something that you have to be aware of but generally you're you're within a team or within a university you're only using a single database system anyway so you're just learning that dialect but there are like small minor changes to it from type of database that you're using alright so I talked a little bit about primary keys so primary keys is a constraint on a table right because if I have a table and I am a database then I have to be able to identify each row in the table in a unique way and that is called the primary key so a primary key is a single column for example you can just say I have a primary key column which is 1 2 3 4 5 6 7 right so every every row just gets the row index as being the primary key but you can do more advanced things by using combinations of columns right so for example if I look at the table here right then the primary key in this table is made by using the product ID and the vendor ID right and that this is because two different vendors can have the same product ID that's just the way that it works right if I order something from from media marked then media mark uses codes for their products but then when I go to Saturn they use different codes but these codes might be the same right so the nice thing about a primary key is that is that it's not limited to a single column but you can also use a combination of columns as the primary key to a table so the goal of a primary key is to uniquely identify each row in the table this is also the part in red is that of course because of this a table can only contain one primary key constraint and these are enforced so if I now in this database try to add a new entry and this entry again has product ID 609 vendor ID 7 then the database will not allow me to do that so it provides a data integrity check and you cannot and because of these primary keys it's very hard to input data in a database which is broken or which is which is inconsistent right so so this allows a database to say well no you cannot input this product ID for this vendor ID because it already exists so primary keys are used in foreign key constraints right so when I build a database for example a database which looks like this so it has an artist and has so an artist is someone who makes music then we have recordings so those are like recordings are defined by album ID artist genre year of release and these kinds of things and then we have genres right so this database the artist ID is the primary key for the artist the genre has a genre ID right because there's many different artists there's many different genre but a single artist can can make music in different different genres and in one genre there can be many different artists right so the idea here is that if you have a recording right so here the primary key for the recording is the album ID right so an album ID is the thing that uniquely identifies a row but because of the foreign key constraint which we can put on the artist ID and the genre ID in the recordings we can make it so that the database will reject recordings for which one of these two is not pointing to a valid source so if I design my database like this which have primary keys and they have foreign key constraints then now when I try to input a recording with an artist ID that does not occur into the artist table right because here there's many different rows here there's also many different rows but if I try to insert a recording with an artist ID that does not occur in the database the database will not allow me to do this if I the same thing for the genre ID right so there's there's there's foreign keys are a way so that you can say well this column in this table recordings can only contain values which occur in the artist table or contain values which occur in an in the genre table right so and this is the way that data consistency is enforced in an object relational database I hope that's clear foreign key constraints take a little while to wrap your head around and of course had a foreign key constraint can also point to a multi-column constraint in another one right but generally you have a foreign key constraint is based on a primary key in a different table right so if I have a database which contains three tables and then every table has their own primary key album ID is also a primary key I actually forgot to put a little red box on this but the the foreign key is just a way to enforce data consistency all right so I want to talk to you guys a little bit about database normalization and there are an official kind of head if you if you do computer science people always talk about databases normal form 5 normal form 3 normal form 8 but I don't want you guys to know it in this much detail right I want you guys to kind of understand what normalization is right so they can have different levels of normalization but I don't require for you guys that you know that all of the requirements for a normal form 1 database or a normal form 2 database right but data normalization is the process of organizing the columns the attributes and the tables so the relations of a relational database to reduce data redundancy and improve data integrity right so for example we can have something like this right where we have a denormalization step so this is more normal than this right so but what we do here is we have a table called names and we have a table called addresses right and then we have a constraint between the names and the addresses here which says that here the nameshift so some guy named shift is living in Mumbai in India we have a guy called Raju which is also living in Mumbai India and you can see of course that this this this table addresses will have like a massive amount of repeat in there so we can denormalize this data right because this is the way that databases tell you to do it but this is this is overkill right so in this case you could denormalize the data and say well no we just have a single table which has a name address one and an address two of course this has a big drawback as well if now we want to store a third part of the address for example the continent then we can't do that well we can do that in the in the normalized form right we could just add another rows saying shift Asia Raju Asia and that this will allow us to store data in a different way than than this right and this limits us so generally it's better to go and denormalize your they are to normalize your data and so to have a a a separation between humans and addresses right because humans are not addresses and addresses are not humans you don't want to put them into a single table no you want to split these this information across these two tables and of course this address table will become very very long and that is in this case okay depending on your use case if you know for certain that there will be no one in your database that has three different that requires three address fields and then you would do it like this hey but here data normalization is also a way of of kind of feeling or having a feeling for what is the best way to to chop up your data into different data tables and use these foreign key constraints and these primary keys right so here we see the three different normal forms so the first normal form you want to remove duplicate data and break data at a very granular level so that means going from a table like this into a table like this the second normal form means that all column data should depend fully on the primary key and not a part of the primary key then the third normal form is that no column should depend on other columns so this is this is the way that we would normalize but i'm not here for you guys to and so the first thing granularity is to break your data down in logical pieces right so if we have a table which holds student names then we can just have a single column called student name which then has the whole name of the student in there however it is better to break your data down into more granular level because then you can kind of do different queries right if we just break the name down into three parts saying that no every name of every student is composed of three different parts you have a first name you have a middle name and you have a last name then of course now we can say well give me all the students whose middle name is haringish or give me every student for which the last name is this right and that allows our database to do sub queries because a database generally cannot look into the values inside of a single column but it can it can use a selector on a column so we can say only show students for which the last name is koi rala right and then you only get the first two right so granularity is very important also when you're when you're storing your own data right so when you when you do experiments then your experimental data you generally want to find a logical granularity for your data of course you can go too far in this as well right too much granularity can be very bad as well like things like phone numbers it makes no sense to to split your phone number into three different parts like a region code area code phone extension code because you're never going to ask show me all of the phone numbers where the the phone extension is 7100 right that doesn't really make sense or show well you might want to select by area code right you might want to see all of the phone numbers in your database which are located into a certain area code but generally this is called over decomposition right so so breaking down your data so taking a single column which has kind of composite data and breaking it down across multiple columns is called decomposing your data making it more granular but you can of course over decompose your data as well which means that you take something which is a single unit and now start breaking it up into kind of smaller units but these smaller units don't really really make that much sense but this is of course very very much based on on your wishes that you have what do you want to do with the data later on we can also have one of the reasons why we use databases is to not duplicate data all the time but a lot of times and you see this a lot in bioinformatics is that when you get data data is non-uniform so here you see we have this column called the standard right so fifth standard or sixth standard but because this this column here is not a foreign key constraint or a primary key this is just a free column so you can put in this column whatever you want but some people will write fifth standard like this and other people will write fifth standard like this right so now i am unable to select from this from this little table all of the individuals which have the fifth standard because of the way that it's written in different ways right so generally if you have these non-uniform duplicate data and this occurs a lot also with experimental data right if i am writing down um so let me give you an example um recently we've been doing research on bxd mice and before these bxd mice are slaughtered and all of these things and taken apart and and and have we measure all kinds of of things on these mice we do a physical inspection of the mouse but when i am visually inspecting the mouse i will use different names for colors than one of my colleagues is right i might call it brown while they might call it light brown and or dark brown right they might they might make a distinction between these two colors but i just call it brown right so so this is non-uniform data um the same thing happens like i was scoring mice and i was looking at their back and then sometimes the mouse has a little hump right and and i call this like i scored a hump in three levels so i say it has no hump it has a medium hump or it has a very severe hump but a colleague of mine just scored yes no right so then then you cannot use this data in the end for things like statistical modeling because when you do statistical modeling you want to have a uniform description of something right so how do we solve this well the way to solve this is to use these foreign key constraints right so we have a a primary key here but now we just make a new table so we have the table with students and then we have a table with standards right so we have the fifth standard the sixth standard this is called id one this is called id two and when we input the standard at which a certain student was scored we just use this one or two and now it's impossible for me to put something into the database um which is different than fifth standard or sixth standard right so it helps the consistency of the data and it means that in the end when we want to do statistics on our data we have a uniform description of the data and we have kind of a more logical divide of our data another rule in database design and databases is that you never store derived data right so if we have a student table right where we have the total marks and the total subjects we don't have a column called average right average is just dividing the total marks divided by the subject right so 100 divided by 10 but we never do that so derive the data or data that can be computed from other columns should not occur in the table and this also holds for data files that you make yourself or that you you you have other people or other students make for you so because this is easily computable you do not want to store the results you just want to compute it when you need it instead of having an extra column into the database making the database bigger slower and all of these things right so never store any derived data there is an exception for when the computation is extremely heavy right if calculating the average takes you two days of computers computer time or two days of cpu time then of course it is okay to store the average but in in this case average is just dividing two numbers so it's computed really really quickly so you should never store the average into the database good so i've been talking for an hour it's going really quick i actually expected to be much further into the presentation already it doesn't matter so for you guys if there's no questions so far i will take a little break and you guys can take like a 10 minute break as well and then we will talk about r sql light so this is a way that you can make a database in r and use this database good so i don't see any questions in chat so i will be back in 10 minutes so that's at 210 so you guys enjoy the first break so i forgot what it was it's either goats or what's the other one i think it's goats this time because i've been working on goat data like the whole week so yeah i'm going to take a 10 minute break i'm also going to run an ad break though so that people joining do not get the uh the pre-roll ads i think those are enabled so when people join the stream they have to watch 30 seconds of ads and they can just run like a one-minute ad for everyone who's watching currently and then you guys can just well people coming in don't have it all right so i will see you guys in 10 minutes so have a coffee have a short break and i will be back very very very soon so