 So today's lecture we're going to focus on normal forms, and I sort of again hyped this up last class saying that they're kind of important, and I think they're useful for you guys to know, even though it is a bit theory-esque. The dirty secret will be at the end I will show that although normal forms seem like a good idea, in practice people don't really follow them. Most people probably don't even know the normal forms exist. And as we go along you'll sort of see it's sort of complicated, right? We'll show how to do mathematical or apply algorithms to decompose schemas into the proper normal form that we want, but in reality most people don't even do this. Most people just wing it and you end up with usually third normal form. And so the reason why I want to spend time discussing this is that there's going to be always new people coming out with new database systems and they're going to make these new claims about how their new database system is much better than a relational database system, all the sort of traditional database systems. And in particular the NoSQL guys are pretty guilty of this. And by understanding the normal forms, you'll be able to sort of cut through the marketing fluff and actually understand intrinsically what these systems are actually doing and whether their trade-offs that they're making in their data model as well as other things is actually a good thing or not for your application. So I'm not saying that NoSQL systems are bad or good because they don't follow the relational model. I'm just saying that they make certain trade-offs and in particular in terms of the normal forms that may not be appropriate for certain applications. So it's good to understand in a principled way what these things are actually doing. All right, so as I said on last class, the goal this week is really for us to try to understand what does it mean to have a good database design. So again, we're dealing this at the application level, right? It's you as somebody who's designing a database to store data for some new application you're building at your company or organization. And we're not really concerning ourselves just yet about what it means on the inside of our system that we're actually going to build. What does that mean for us? What do we have to support? We'll get to that starting next week. But again, we need to understand what kind of applications people want to write in our database system. So the two definitions or two goals for we want to achieve for having a good database are the following. The first is that we want to ensure that we maintain the integrity of our data. Meaning if you put good data in, we want to get good data out, right? If we put garbage data in, the data system can't magically make that, clean that up for us, but we obviously don't want to put stuff in and then get back different things later. And then we also want to get good performance. And we are not really touching on that too much in last lecture and this lecture, although it sort of crops up in certain cases when we talk about how we actually enforce functional dependencies. So again, this trade off between correctness or integrity versus performance. This is a re-encoring theme in databases and it's going to come across this multiple times throughout this semester. For our purposes here, we're really sort of focused on the first one here. We're caring about integrity. So today's agenda will be spent all the time on the normal forms, walk through the major ones, and then we'll spend a little bit at the end talking about what does it mean for a no SQL system to not follow a third normal form and then instead to denormalize their database. And we'll show what the sort of implications of that, both from the application standpoint, but also from a performance standpoint. So the last class we spent time talking about functional dependencies, right? We showed how that if you're given a schema, you're given a bunch of functional dependencies, we can then extrapolate additional functional dependencies using the Armstrong axioms to compute the closure. And then we can start reasoning about what functional dependencies will be held or covered by a particular schema. So now what we want to do is now we want to take our schema and our functional dependencies, and we want to search for what are called bad functional dependencies. And I'm putting bad in quotes because it's not necessarily mean that they are doing the wrong thing from the application standpoint. I really mean that they are difficult for us to enforce or may end up getting violated based on how we do our schema decomposition. So the idea what we're going to want to do is to do normalization is we're going to pick a table and then based on some rules as defined by our reasoning about the functional dependencies, we want to split them up into sub-tables. And then we'd sort of want to keep doing this over and over again, until we end up with what is called a normalized database. And normalized, that term is actually ambiguous sort of mathematical definition for what does it mean to be a normalized database other than as defined by the normal forms. So the way to think about a normal form is that it's a characterization or a way to reason about a particular decomposition for a schema. In terms of the functional dependency properties that we talked about last class. And that when we put the relation back together using natural joins, we may or may not have certain anomalies. We may or may not satisfy the properties we talked about before. So in last class, I didn't really mention this or what it means to do natural joins and put the decomposed relations back together. But what you end up doing is generating what's called the universal relation. So think about in your database, if you just combined all your tables together and just made one giant table in Cod's original paper, he called this the universal relation, right? And we sort of showed a few examples what this little looks like and you'll see why it's a bad thing. And then so for performance reasons and for just us reasoning about the data we have in our database, we want to do our decomposition into a different normalization level. And then that normalization level, what properties it will guarantee are represented by what is called its normal form level. So the three properties we talked about last class were lossless joins, dependency preservation, and redundancy avoidance. So I want to go through at a high level quickly again as a refresher because it was at the end of last lecture. And then again these will be important when we talk about each different normal form, which of these three properties do they guarantee? Right, so the first decomposition property is the lossless joins. And then again the basic idea here is that if we decompose a single relation into two separate relations or subrelations, that when we join them back together using a natural join, we won't generate any bad data, dirty data, or noisy data, or incorrect data. We showed an example where if you joined a table back together, you all started sort of having rows that didn't actually exist in the original or universal relation. And so if we want to decompose our tables, we want to make sure that we don't have this property. And we said this was mandatory because we don't want to have data that shouldn't actually be there. And so the test for this is fairly simple. If you take the intersection of either R1 or R2 for a decomposition, you're either going to get back the original relation or one relation R1 or one relation R2. And if you have that, then you know you're guaranteed that you're not going to have any lossless joins. The next thing is that dependency preservation, and this one is pretty straightforward. The basic idea is that if you decompose your relation into different subrelations, there is no functional dependency defined in your provided set of functional dependencies that will span multiple relations. So if I have attribute x implies y and z, all three of those attributes x, y, and z will have to appear in a single relation. If it's split across multiple relations, then it's not dependency preserving. Again, we said that this was a nice thing to have, ideally, but it was expensive to enforce because at runtime, you'd have to actually join the two tables to see whether your function dependency is still valid, still holds. So the way that we can test for this is that we just compute the closure, which again was the explosion of all possible functional dependencies. Given the provided ones, again we're using Armstrong's axioms to expand this set. And then we just check to see that for every relation in R1, the closure of the functional dependencies in R1 are covered by the relation. The last one, which we were sort of ambiguous about, is redundancy avoidance. And the basic idea here is that you don't want to have repeated attributes in your tuples, or sorry, attributes in a single relation. And we want to do this obviously because it's a waste of space. And then we have the problem where we may have an update anomaly where if we have a particular entity replicated multiple times across different tuples, if we have to change something about it, then we need to make sure that we update all of our tuples that have that repeated value. And so the test for this is pretty straightforward, right? For a given function of dependency, x implies y. If x implies y is covered by a particular subrelation, rn, then x just has to be a super key of rn. Meaning for a given value of x, you'll get a specific or unique value of y. If you don't have that, that means you'll have duplicates. So again, as we go through examples of the different normal forms, these different issues will come up. So the history of the normal forms is that it goes back to the very first paper from Ted Codd in 1970. And in that first paper, he reasons about the universal schema, which is just sort of the giant single table of everything that's in your database. But then he also talked about the first normal form. And then later on in 1971, he then defined the second normal form and the third normal form, which again has a more restrictive, more refined approach of the first normal form. And then in 1974, he teamed up with this guy Ray Boyce, who was at IBM Research. And they define what's called the Boyce Codd normal form, which is a sort of a more restrictive version of the third normal form that sits in between the fourth normal form. And so then they came out of paper in 1974 that defined this. So Ray Boyce, he was actually one of the co-creators of SQL, along with Don Chamberlain at IBM. And then he died of an aneurysm in 1974. So his last paper was the Boyce Codd normal form paper. So if you google his name, you see all these old dudes that are clearly not him, because he died when he was 27. Okay, so that's the history. So again, the normal forms are again a part of the original relational model work. There's a lot of work in this area in information theory that we're not really going to cover. But the main ones that we're going to focus on here are that are in the textbook are the seven normal forms that you have here. So the way to sort of understand this list is that as you go from the top to the bottom, you end up with more restrictive schemas. Meaning there's more properties that have to be enforced or guaranteed by your schema in order to reach a certain normal form level. So the two that we actually really only care about in the real world are the two guys in the middle here. The third normal form and the Boyce Codd normal form. These are the most common that you actually see. This is what people normally try to achieve when they do database design. And again, the dirty secret is that if you just take an ER diagram and convert it to SQL, you end up usually in third normal form. So you don't have to do anything special to get there. It's just there is a mathematical definition for what it means for your schema to be in a third normal form. The first normal form is just when you have a single table and you don't have any, all your attributes are atomic. And you don't have any groupings, which I'll show you what that looks like in a second. The second normal form, depending on who you talk to or what textbook you read, they say this is obsolete. They say that people shouldn't do this. But in practice, I think there was a survey maybe 15 years ago, where a scientist went and looked at actually real world database schemas. And a lot of them end up being in second normal form, right? Because it's just sort of good enough. We'll see what it means, what can happen in your second normal form. But a lot of times people just do this and it is what it is. For the fourth and fifth normal form, this is basically multi value dependencies. We don't need to really worry about this. If you're curious, it's in the textbook. I can't give you actually a formal definition because I don't care. Because I've never had to do fourth and fifth normal form. But I just want to say that it does exist if you want to learn more at the textbook. Then later on there's this thing called the sixth normal form. And now you start to get into weird stuff, right? This is like, now you start to get into hardcore information theory that has no real practice in the real world. I don't think there's a seventh normal form. But there are a bunch of additional normal forms beyond this that have come out over the years. And again, so you have domain key normal form, elementary key normal form, and so forth. So you can sort of see the years that these things were sort of all proposed. And again, there's no database system that I'm aware of that actually can enforce all of these things, all these sort of esoteric normal forms. This is sort of like, you know, start gazing done by theoreticians about, oh, the database could do these things, right? These things have certain properties that maybe people actually care about. But in practice, nobody does these things and everybody cares about, you know, mostly the middle guys here. And again, typically by default you end up in the third normal form, all right? So another way to think about this, again, is it's a hierarchy. Where if you're in one, if you're in a higher level normal form, then you're automatically in the other normal forms. I don't know about the sort of the more theoretical ones, but at least from up to one enough to five enough. Whatever sort of level you're in, you're also in the other level. So if your schema is three enough, you're also two enough and one enough, right? And so forth for the other ones. And again, the ones we care about are the ones in here because these have, these are actually the most practical. And then from a human standpoint, from a common sense standpoint, to me they make the most sense, right? Whereas the other ones are a bit mathy than I would like. All right, so let's start with the first normal form. This is pretty easy to understand. So the first normal form has two requirements. The first is that all of the types, so all the attribute values, have to be atomic or scalars. And the second is that you can't have repeating groups, right? So let's say for this particular example here, what's the first violation we see? What's that? What's wrong with C name? It's an array, right? So it's not a scalar, right? So this would violate the first normal form. And so instead, what you want to do is maybe have it be a single attribute. All right, so you have an atomic value. But now, so if I did this, let's say that I have, now, because my loan with me and DJ Snake had two customers, right, me and the other guy. So now I have two columns for the customer name. What's the problem with this? What's that? Right, repeating groups, exactly, right? So from a practical standpoint, this seems like a bad idea of how we'd actually implement this, right? Because the bottom two loans, they only have one customer name. So they're wasting space for these other fields that are not actually being used. And then for this example here, there's one loan with two customers. But what happens if there's somebody who has three customers or four and five and six, right? Every single time I need to accommodate more people, I have to go back and update my schema. So the first normal form basically says that you have to have a single attribute for repeated values. And then you just end up making another tuple, right? So this particular schema here is considered valid first normal form, right? This is pretty straightforward to understand. So what's an obvious problem with this, though? Mm-hm. How do you define some groups that are repeating? This question is, how do you define that there's some groups that are repeating? Right, so again, I need to store multiple values for a particular attribute. So I'm gonna represent that by multiple attributes, right? I have multiple customers that are assigned to this loan. Then what I'll do is I'll just make an extra attribute to represent each one of them, right? So the first customer is me, the second name is DJ Snake. And then so forth for all the others. If the third C name was not there, would there be no repeating groups? His question is, if the third C name was not there, would this still be considered repeating groups? Yes. See this from your perspective. Is there a computer when it needs to be determined? Right, so his question is, is this something that we as a human, as the actual person designing the database, is this something we have the reason about and the computer can't do this or this? Absolutely. So I can easily make this table in Postgres or any relational database and it's fine. The database system will only enforce the things that it can enforce. This is sort of a high level guidance, sort of a theoretical guidance of how we actually want to design a database. But yeah, the data system can store this without any problems. Now, I will say also now, again, this is from a sort of theoretical purist point of view. This is a bad thing you don't want to do this. In practice, many database systems actually now support array types. I think at this point it might be in the SQL standard, it may or may not be. But in Postgres, you can easily define an array type. So this is actually valid, it would be valid if you could put this in. But again, from a theoretical standpoint, it violates the first normal form. But in practice, maybe that's okay. Okay, so my question was, when we went to this schema here, I said this is valid first normal form, but what's an obvious problem? Redundancy, absolutely, right? So we have me and DJ Snake are in the single loan and we're repeating the loan amount. And then we have the branch name of the bank and the amount of assets that they have, and that's being stored redundantly as well. So we can now refine this even further in the second normal form. So in the first normal form, I didn't care about any of the functional dependencies, right? I just cared about whether things were atomic and things didn't have repeating groups. Now to go in the second normal form, I actually have to reason about my functional dependencies. And for this, we're gonna say that to be in the second normal form, you have to be in the first normal form, obviously. So everything has to be atomic, and you can't have any repeating groups. But then you have to say any non key attributes will have to fully depend on the candidate key for the relation. So let's go through an example of what I mean by this. So say this function dependency down here, we have loan ID implies the amount and the brand of the bank that made the loan. So what we'll do is now we can decompose our single relation into multiple relations where the loan ID is being used as the join key to go from the first relation to the second relation. So now all of the data that the loan ID implies based on the function dependency is now stored in this single relation here, right? So now I actually don't repeat the before I was here, and I was repeating the amount of the loan. Now when I do this decomposition, now the loan amount is only stored once, right? There's no more redundancy. Then we can apply this again for the second function dependency, where you have the branch name implies the assets. And again, so here we see in our assets field where we're repeating the total amount that the bank has multiple times. So now we can decompose this, split it up. So now we have sort of the bank table that has the single branch name and the assets, and then we have a sort of cross reference table that'll go from the bank name and loan ID to the loan. So this is considered valid second normal form. Because of this, we say that the non key attributes fully depend on the candidate key, right? So we have a non key attribute here, assets, and that depends on the candidate key there. The middle one's a little more complicated, but same over here. The branch name, the amount depends on the loan ID, and it's stored separately there. So this in, when I said earlier that the second normal form is good enough, right? This is good enough. It's pretty close, yes? Are you saying non key attributes? So does that mean that the foreign keys to all the foreign keys? I mean that they're candidate keys. We're not really talking about foreign keys. So take the first function dependency, branch name implies assets, right? Branch name is the candidate key in this relation because with that, with a single branch name, I can get all the information I need from that relation. It's only in one other attribute, assets. So everything, that sound there. Okay, so there is no candidate key. Correct, yes. Because there's nothing in the function dependency that really implies how we should actually maintain customer name and the branch name and loan ID together. Yeah, well, and that wouldn't be, yes, but now you'd go into another normal form, right? At this point, again, the basic rule says we have to do this, this is valid. There's obviously, again, right, we can fix this thing, because we're repeating the branch name here, right? But now when we do this refinement, now we're going to a lower level, right? And again, this is what I was sort of saying from that the second normal form is often just good enough because this is not perfect, this is not great, but this is not in the world if we had this, right? So rather than jumping to the third normal form, I want to jump one level lower and do the Boy's Cod normal form, BCNF, because then we'll see what the issue is with BCNF and how by going back up we avoid that problem. All right, so just bear with me. Okay, so in BCNF we're going to guarantee that we have no redundancies like we saw under 2 and F, and then we're not going to have any lossless joins, but we're not going to be dependency preserving. And so we can mathematically define Boy's Cod normal form to say, if you have a relation R with a functional dependency set of F, we can say that it's in Boy's Cod normal form if for all non-trivial functional dependencies in the closure on F, then the left hand side will be a super key on that relation. So non-trivial functional dependency may seem things like X implies X, that would be a trivial one. X implies another attribute, X implies Y, that's considered non-trivial. So if we compute our closure and then for all of these non-trivial functional dependencies, if the is an X that is the super key for that relation, meaning I can get all the attributes for that relation just by having that one single value, then I'll be considered in Boy's Cod normal form. That's very dry, that's sort of unpack. So let's walk through an example. All right, so let's keep it really, really simple. So we have a single relation R, it has three attributes, A, B, and C. And then we have two functional dependencies, A implies B and B implies C. So the first thing we wanna do is compute the closure of F, which again, we use Armstrong's actions to explode this. And then for each of those, we can then examine them to see whether the non-trivial attribute is the super key. So the first one, A implies B. We know if we have A implies B, we can get all the other attributes in our relation, so therefore, right? Because if you have, if you get B, if you have B, then you can get C. So you have A, you can get the other two, right? So A is a super key. Same thing for A implies C, right? We already showed in the last step that it was a super key. But this last one here, B implies C, you can't get A from B, so therefore, B is not a super key. And therefore, this particular relation is not in BCNF, right? Because again, you look at the functional dependencies. And for anything on the left-hand side, if you can't get, if it's not a super key in that relation, then it's not BCNF. So now to put it in BCNF, you do decomposition. So I take that same relation, R, and now I'm going to do two subrelations, R1, R2. And now I want to check to see whether the total schema, R1, the total schema that is comprised of R1 and R2 is in BCNF. I'm going to check R1, R2 individually, right? So the first one for R1, I take the closure on F. And then for all the non-trivial functional dependencies, I check to see whether the left-hand side is a super key. So for R1, we just have A and B. And with the functional dependency A implies B, we can get, obviously we can get B, so we're covered here. So A is a super key on R1. So this is OK. Now we check R2, B implies C. The B is therefore a super key. Therefore, for both relations R1 and R2, they're both valid BCNF schemas, so therefore the total database is in BCNF form. Yes? I didn't decide to split up. His question is, how do you decide what estimate to split on? Two slides. We'll get there. This is just showing you how to check to see whether a schema is in BCNF. So more formally, we can say that if we're given a schema R and a set of functional dependencies, we can always decompose it into subrelations. And those subrelations will always be in BCNF form. And we can guarantee that the decompositions are lossless, meaning if we take the natural join and all our subrelations have put everything back together, we'll always get the original table. But some BCNF decompositions, as we'll see in a second, may end up losing dependencies. Maybe we said that dependency preservation meant that for a given functional dependency, all of the attributes that are involved in it can be found in a single relation. If you span multiple relations, then it's considered you're losing the dependency. So the algorithm to actually compute this is pretty straightforward. I'm not going to go through line by line here to say what exactly is going on, but I'll walk through an example to make this more clear. But the basic idea is that you're going to first compute the closure on F, and then you're going to put all the relations you have in your database, which at the beginning is just a single relation, your throat in this set called result. And then you just keep iterating recursively over the result set. And you'll find any relation that's not in BCNF, and then you'll decompose it into subrelations, and you'll split it on a functional dependency that is covered in just one of them. And then where that the left-hand side is not the super key. And then what you end up at the end is two subrelations from your original relation are that you're in this iteration, where the first one will include the right-hand side of the functional dependency that you split on. And then the other one will not include that. And then so that means when you take the unit into these two together and put it back in the result set, now you're adding relations or schemas or relations that are now in BCNF form. And you just keep doing that for all your relations until you get down to the bare form. So in this case here, when you split, when you decompose your relation, one of them will be in BCNF. The other one may not be, so you got to come back and recursively apply the same decomposition algorithm on again. And you keep doing this till you can't do it anymore, and now you're considered in BCNF. So let's walk through an example, because this is a bit dry. So I have a single relation, and now we're going to have four attributes. We're going to have the name, social security number, the phone, and the sitting. And so for this, we see obviously that we have done information because in our universal schema, because my name is repeated multiple times because I have multiple phone numbers, which is not true. And then the DJ little fame, he's repeatable multiple times because he has multiple phone numbers as well. So we see a lot of redundant information here. So the first thing we're going to do is take our function dependency, social security number implies name and city. And we compute the closure on this, which is just generating the more basic forms. Social security implies name, social security implies city, and social security number implies name and city together. So then now with the closure, we look at the relations in our result set. In this case at the very beginning, it's just R because we haven't done any decomposition yet. And then we're going to choose to split it up or decompose it into subrelations based on which attributes in the functional dependency is not a super key. So in this case here, social security number implies name and city. We know that social security number cannot get us the phone number. So what we're going to end up doing is split the relation based on their social security number, where one relation will have the things that can be taken care of by the super key, social security number, name and city. And then the other one will have the thing that can't be taken care of by social security number. So that's social security number and phone number. And so now again, so the first one, R1 will have name, social security number, and city. And the second one will have social security number and phone. And then you end up with subrelations like this. And then now also too, I can define what the primary key is for our relations because I know that in the case of R1, for a given social security number, it has to imply the name and the city. And therefore, it's a super key. So therefore, that can be our primary key. And then in the second relation, social security number and phone number have to be unique for the entire relation. So therefore, the two together can be the primary key. We're not even defining foreign keys here, but it's sort of implicit because the social security number in R1 has a foreign key relationship with the social security number in R2. So now in this case here, we have two subrelations. And then we want to go and check to see whether they're in BCNF. And in this case here, we said that we had to check to see that the super key for each relation or the left-hand side of the functional tendency in each relation can get us all the information that we need. And therefore, it's a super key. So again, the first one, social security number, gives us the other two attributes. And the other one, social security number and phone number, get us. If you have the two of them together, you get everything because there's only two attributes. We can also now check to see whether we're lossless. We join these two together, we get back to the same table. And then we also can check to see whether we have any anomalies here. Could we have any update, insert, or delete anomalies from this? Yes or no? So what's an update anomaly? Update anomaly is what? That you could modify, if values are repeated multiple times, and therefore, if you need to update all of the values, you have to update multiple tuples. We don't have that really problem here. And this is also in the single relation. So if we update social security number in the first table, then we're fine there. In the second table, it'll implicitly update everything else. We can't have any delete anomalies, meaning I can delete all my phone numbers, and I don't delete the people. And then for insert anomalies, I can insert a person without having to require them to have a phone number. So we don't have any of those anomalies. So this is considered invalid BCNF. So again, it seems sort of obvious, right? We said that we couldn't have an array type for an attribute. So therefore, we had to break it up into subrelations. And then we didn't want to have the redundant name information for all these different phone numbers. So we broke the phone number out into a separate subrelation. And then lo and behold, we end up being in BCNF. For what seems like an obvious thing to do. It was like one of the phone numbers you'll lose. Yeah, but there's no function dependency defining that. There's no function dependency defining that the social security number between two tuples has to be exactly the same, right? We're not defining the foreign keys. The foreign key would say you can't have a social security number here that isn't in this one, right? But there's nothing within that relation to say you wouldn't have this problem. All right, so here's looking at a contrived example where you can have a problem with BCNF. So let's say we have a really, really simple database of a company that sells products. And in our relation, we're going to have the item that they're selling, the company that produces it, and the category that the item exists. And let's say we have two functional dependencies where an item implies the company. Therefore, only one company can sell a particular item. Sorry, sorry. Yeah, an item can only be sold by one company. But then I have another functional dependency that says the company and the category implies what item they're selling. We'll see when I actually show the schema or actually the actual database instance for these seems kind of weird. But then from the relational model, this functional dependency, the second one is actually valid. It may not be what your application wants, but it'll produce the anomaly or the problem that we can have with BCNF. So for this, we're gonna find that the super key is gonna be the item and category. Because if you have the item, then you can get the company. And then if you have the category, you can get the category. So that's our super key for this relation. So say now we wanna decompose it to now put it into BCNF where we now split up the item and get the company and the item gives us the category, like this. So now when I have my subrelations like this, this is valid for BCNF because we'll be able to enforce all of the functional dependencies that are within a single relation. It's lossless, meaning we put the two back together, we'll get the same data. But we actually can't enforce this functional dependency over here because now it spans two relations, right? So BCNF is not always a dependency preserving. So this is fine for us. Again, for all the local functional dependencies are the things that are covered within a single relation. But when we actually join it back together, this is still valid from, we didn't get any information loss. We still have all the tuples we expect to get. But the difference is now that we would violate the functional dependency that was originally defined on the relation. So this is basically saying that you can break things up and the individual relations are fine and you'll be able to enforce the functional dependencies. But when you put things together, your universal schema may end up violating the functional dependency that shouldn't have been violated in the first place. This is sort of weird, right? It's just the basic idea to understand is when you join things back together, you don't lose information, but you may end up violating a functional dependency. And that's different than losing a functional dependency that we saw in BCNF, right? You would lose it if you can't span multiple relations. And the reason why that matters is because when you put it back together, you end up with technically an incorrect database state. Right, so sort of to say this again in a more formal way, we started with a relation R and it's functional dependency set. And then we'll use that simple algorithm to decompose it into subrelations. And each of those subrelations are gonna have their own functional dependency set, our local functional dependency set. And then within that, within a single relation and its functional dependency set, we can guarantee that we have a valid database instance. But then we can also reconstruct R by doing natural joints to put it back into the correct state as it was. It's a lossless join. We don't have any garbage data. But we actually can't reconstruct the functional dependency set, right? Our database instance that after the join and violating it. So this is the problem with BCNF because it's too restrictive of how you're decomposing things based on the rules. And so this is what the third normal form allows. So the third normal form will actually preserve our functional dependencies, both the local ones and the global ones, but we may end up having some anomalies because we're gonna have redundant data. And so we can define this more formally to say we have a relation R with a functional dependency at F. We can say that it's in third normal form if for every functional dependency in our closure, that functional dependency is either trivial, right? Meaning X implies X. Or X will be the super key or Y will be part of the candidate key. And again, everyone's glossing over. Your eyes are glazing over as I tell you this. Let's walk through an example and both on the slides and through Postgres and hopefully this will become more obvious. So the algorithm we're gonna use to decompose a relation into third normal form is slightly different than we saw with BCNF. With BCNF we sort of took every relation recursively applied our decomposition on them to keep breaking them up. The idea with third normal form is that we're gonna start with nothing and then we're gonna build relations by figuring out which ones, figuring out which ones can be covered by the different functional dependencies that we have in our canonical cover. Again, the canonical cover was the minimum set of functional dependency that you can have that will satisfy the sort of provided set of functional dependencies. And so if we end up with a result as we build out our relations that is not considered lossless, then we just add another relation with the appropriate keys so that when we join things back together we end up with the correct result that we're looking for. All right, so let's look at an example. So again in the first step you compute the canonical cover and then this was just again applying the Armstrong's axioms to decide how to reduce it down to this minimal form. For this example we're using again ABC where A implies B, B implies C. The canonical cover is just the same thing, right? There's no reduction you can do for this. So now we're gonna split R based on those functional dependencies. So we'll select the first functional dependency A implies B and we'll say that we'll build a relation that can cover that. And then for B implies C we'll build a second relation that covers that. So now our schema looks like this. So this ensures now that we enforce all our dependencies and that when we join them together they'll also be enforced but now the problem is that when you actually do the join to check see whether it's lossless you end up with more garbage data. You have three tuples that did not exist in the original universal schema. So the algorithm says the way to rectify this is that you now need to add another relation that contains the keys that you wanna join on that are not being covered by these functional dependencies. So in this case here it'll be a third relation where we have A and C together so that now when you join all three together you end up with the correct state of the database. And then this is considered in valid third numeral form. So we have redundant information, right? We have now a additional relation with the values of A and C. And technically this is duplicating the values A in the first relation and the values in C in the second relation. But we need this so that when we combine things back together we get back a lossless join. So again the difference between 3NF and BCNF is we will have redundant information in 3NF but we'll be able to guarantee that all our dependencies are preserved. So this is trivial to test with in something like Postgres. All right, can everyone see this? Right, so I have my first relation, right? Has three tuples R1, R2, R3 and then I have my second relation that has BA1, BA2, BA3 and then all the values for C for them, right? So now when I do a natural join between these two guys I'm actually getting incorrect results because I had four tuples of four and what I'm missing now is the second entry for A2 because it's not being captured in my other guys. So now if I add another natural join or I say select star from R3 if I add that third relation that I talked about, actually I think we're missing a tuple too, right? Insert into R1, value of C. Choose A2, BA2. So now when I do my natural join with R3 we get the four rows that we got from the original decomposition. So again, natural join is just matching up the tuples based on the names, right? In R1 there's a column called A and R2 there's, sorry, in R1 there's a column called B and R2 there's a column called B and it matches the names and does the join based on those. And I didn't have to actually define any foreign keys. So just to summarize what we've covered is that with BCNF you're not gonna have any anomalies which you may end up losing some functional penises and in practice this is what you want. In 3NF you keep all your functional penises but you may have some redundant information so you may have some anomalies. Yes? Yes? Right, so the original table had this. It's, so it's from the sort of the schema standpoint it's still valid because if I have a value of A then I get the same value of B. So again, relational model has no, it's unordered sets. So you can have duplicates. There's no distinct tuples, right? You want to do distinct tuples you add this distinct clause in your SQL statement, right? So it has to do with when you put it back together you end up getting more than you actually should which is allowed because you're not joining on the C that you're missing. You'd like to just be losing the last row in your stable. Right, but the, from a theoretical standpoint, yes. Sorry, from an applied standpoint, yes. From a theoretical standpoint, there should be two tuples. In reality, you're right. They're the same so they're redundant. All right, so again, the BCNF is what you probably want to strive for when you design your database application. 3NF is usually what you get though when you take your UR diagram or UML diagram that design your database and just write it out to SQL. And you sort of get this for free. So the dirty secret about the normal forms is that as far as I know, this is not a normal form. This is not a normal form. This is not a normal form. The secret about the normal forms is that as far as I know, this is not how people actually design databases. Right, nobody says, all right, here's the function dependencies for our application. Let's bust out the textbook and look at the algorithm to decompose these things into a Boyce Cotton Normal Form. Nobody does this. There's been a ton of literature and every database class that you could ever take at any university, they will spend a lot of time on the normal forms and things like that. And that's the possibility of this because I just spent an hour teaching this, plus functional dependencies last class. So in academia, people, you know, database professors think that normal forms are, you know, the greatest thing ever and of course this is exactly what you want to do, right? But in practice, nobody actually does this. I've yet to have a student come back to me and say, like, oh yeah, thank God you taught me how to do decompositions, right? Because we totally had to use this. And instead what you see is that people don't think in terms of functional dependencies like this or these normal forms, you actually think about instead in terms of sort of object-oriented programming because this is what you're going to end up actually how you're going to implement your application, right? So if you've ever done any development using any of these web frameworks, Django, Ruby on Rails, Node.js, Hibernate is not necessarily a web framework but it's a library to talk about the database. In these application frameworks, you end up defining objects in, you know, Python code, Ruby code, Java code, whatever and the object relational mapping library converts them into the underlying SQL statements to create your database. And so at no point do you ever go look and see, well what is my, what normal form is my database in? Because you don't care. Because you're writing your things in terms of objects. So this is sort of one of the main motivations for the Node SQL systems. So it's a well-known problem to map a object-oriented program to a relational database is non-trivial, right? Because I said in the beginning you couldn't have array types but it's very common in objects to have arrays. And so there was a big movement for object-oriented databases in the late 1980s, early 1990s. There was a bunch of startups that did a bunch of these kind of systems. You've never heard of any of them because they're all gone. But this basic idea came up again in the late 1990s and early 2000s because there was all these XML databases and then now 10 years ago all these JSON databases came out, these document databases, and they're basically making the same argument that people as programmers don't write programs thinking about normal forms. You write in terms of objects and you want your database to store objects. So this is essentially what led to the Node SQL systems. So one of the key tenets for the Node SQL movement is that they argued that joins were slow to do so therefore you want to denormalize all your tables. You don't want to do the decomposition that I just showed you to break it off into sub-tables because every single time you've got to put an object back together you have to join them and stitch everything back up in memory in your program, right? And if your ORM, the Object Relational Mapping Library you're using in your application framework, maybe it's smart enough to know that, oh, I can do this as a join versus having to do multiple queries to go collect all the data you need in your object. So the web took off in the early 2000s people actually started needing what I call a high-performance database system. They needed a database system that could sustain a lot of operations, a lot of users accessing the website all at the same time. And if you have a complex application and you're hanging out in voice-cognormal form then to reconstruct an object for something in your application would be a lot of joins and could potentially be very expensive. So there's two things about the NoSQL movement. We'll cover the first one now and we'll cover the next one later when we talk about transactions. But I'll say later on you'll see that not only will they drop protections for the integrity of your database by denormalizing they'll drop protections for actually doing updates and modifications by not providing you transactions. I can, you know, there's justifications for making these both these assumptions but they sort of claim that like, oh, of course you want to do this. You never want to use transactions, right? Which is not true. And the same thing, you, you, they would say, oh, normalizing your database is a bad idea. Relational model is a bad thing. It's not true. There's some cases where they are actually better and some cases where the relational model would be better for both of these. So the document databases that come out in the last 10 years or so, the one that you're probably the most familiar with is MongoDB, right? And so when I say, again, as we said in the second lecture, the way to think about a document database is not in terms of Microsoft Word files or PDFs. Think of it as a JSON object or an XML file. And so the first version of MongoDB didn't actually support joins. More recently, they, in the last year or two, they added server-side joins. But it used to be, they would argue that the way you actually want to join data together is just denormalize everything and nest everything together instead of a single JSON object so that when you want to go fetch the thing you need for your application, it's one cause to the database to go get everything you need. You don't have to do a join. If you end up needing to do a join, they would say you have to write your join on the application code. Meaning you basically write a for loop to iterate over objects and go grab them in multiple requests. We'll cover this later on, but that's always a bad idea because I guarantee you in your JavaScript application would not be able to write a join as efficient as the one that the Oracle guys wrote in their database system in C++. But that's beside the point. So again, they're going to argue that you want to do what are called pre-joins in your collections by embedding the objects inside of each other. And again, MongoDB is probably the most famous one. Mark Logic was an XML database that came out in the early 2000s. It's precursor to all these things. RavingDB is a document database for Windows. Couchbase is the former men base, which is the former CouchDB. It's gone through a bunch of different names. And then RethinkDB was another famous, somewhat famous database system. If you read Hacker News, they would always go crazy about this one. They went bankrupt last year, but now it's actually open source. Think of RethinkDB. RethinkDB was sort of being pitched as a better written version of MongoDB. At the time, maybe that was true, but maybe not so much anymore. All right, so let's look at an example here. So if we go now, say, in the Boyce Cognomal form, I'm going to model an application where a customer has orders and order has order items. Think again, your Amazon account, you have an Amazon account, that's their customer account. You make purchases, which are orders, and then within each order, you buy different items. So in Boyce Cognomal form, you would model it like this, where you would have the sort of separate relations to represent each of these entities, and then inside the relation, you would have the keys you need from your decomposition to be able to join things back together, to get back to the original universal schema. So if you wanted to get now a single customer record and this order and all the items that they bought, you would do a three-way join between these two relations, or sorry, three relations, and that would get you back the data that you need. In the MongoDB world, the document database, document model world, they say what you really want to do is have a single customer object, and inside that object, you're going to have an array that has all your orders, and then for each of those orders, you would have all your order items. What is this? What normal form is this? What's that? Sorry, is that it? Yeah, it's actually sort of worse than the first normal form because you have an array, right? So yes, it's technically first normal form but a little bit lower. Again, if you think about this, when you actually look at the JSON object, here now I'm storing like the item ID and the price inside of an object which is inside of an array, and that's inside of the order array, right? So you have again this sort of nesting going on. So now let's say that I also record information like what item was bought, what sort of metadata it had about it, that you need to either embed that inside of the inner object or sort of that in another relation, right? So they would argue that this is the right way to go. And I would actually agree, in some cases, maybe not from this particular example, for some cases storing your data this way physically is the right way to do this. What's the keyword I just said there? Physically, correct, yes. So the document data model sort of blurs the line or the separation between the logical layer and the physical layer, right? So not using joins is not necessarily a byproduct of using the document model. Like you can use the document data system and decompose your collections or as what they call them relations into BCNF and that's just fine. Whether the system supports joins or not depends on engineering, right? MongoDB now supports inner joins. But they would claim that denormalizing on the document data model actually makes everything more natural to query. And I would disagree on that point because now what you end up doing, you end up baking in your application code the physical layout of how your data is actually being stored because no longer do you have a collection or relation for the customer and a separate collection or relation for the orders. They're now embedded inside of each other, right? And there's actually nothing in the relational model that prevents you from storing it physically in the way that they prescribe. And I agree. That's actually the way they propose here is actually better. If I want to get a first single customer, they're all their orders and order items. It's a single fetch onto disk or memory, whatever it is, and all this data is contiguous to each other. If I had to do a join where these things are physically stored in different locations, say even on a distributed database on different machines, then I'm sending requests to different machines and then sort of putting it all back together. But if I can pack these two guys together on a single disk page, then it's one seek to go get the data I need, assuming an old hard drive, and then you fetch that one page and voila, everything's actually there. But it's hard to do that to make a decision whether to do this or not if you're baking in the logical layer in your application code that this thing has to be embedded inside of this thing and that thing has to be embedded inside of this thing. So if you go back to this sort of JSON example that I had here, roughly the MongoDB query syntax you would actually use to execute this query to do a lookup to find the item that somebody bought with item xxx, you would write it as a lookup on the customer's collection and do a find where orders.orderitems actually they're missing item id there but then you look up the item id. So now let's say if I decide later on that I want to not denormalize my relations and I want to break out the order items and put it separate someone's else. I can do this for software engineering reasons like it's easier to maintain the code that I actually can store different order items and maybe not always I want to store order items for every order or maybe I want to have for whatever reason another something else create more order items but I also could do this for hardware reasons maybe I want to put the order items collection or relation on a machine with a really, really fast disk or a lot of RAM and I want to put the other relation on a slower machine because maybe I don't update it that often. I can't do that easily because I'm baking in my application code the actual schema, the logical schema which dictates the physical schema. So now let's say again if I now denormalize my database and I may again move order items out I have to go back in all my code and update everything to now maybe not do .orderitems I got to change it now to actually point at the collection I'm looking for but under the relational model with SQL to do that three-way join I just define what the answer I want and at no point does this actually specify anything about how the data is actually being stored so the customers table, the orders table and the order items table they could be on separate machines I don't know I don't care or actually I could do what I could do is I could pack together within a single page a single record the nesting, the physical nesting where for each record I have my orders and then inside of them I have all the items that they purchased. So this is not a new idea this idea of having the logical normalization but the physical denormalization to speed things up as we'll see in a lot of cases throughout the semester the IBM guys thought about this in the 1970s with system R they actually did it the way I'm showing you here in the very first database system they built in 1974 turns out this is a huge pain in the ass to actually maintain and they end up abandoning this later on but it shows up now in newer papers so the Google Spanner paper if you remember this Google Spanner system it's a relational database system they have a way to pack together related fields in the same page so that when you want to do a join to go get the nested information in each batch to go to get the thing you need so it's an old idea that's been around for a long time but now it's actually showing up in newer systems and it's difficult to do this in a no-seq system if you don't have that abstraction between the logical layer and the physical layer so this is clear again the IBM system R project was very very interesting it's crazy how much stuff that newer systems are doing now if you go back and read the annals or the interviews with the early IBM researchers they tried everything and a lot of it didn't work because the hardware couldn't really support it back then and machines were really expensive it was hard to maintain but now people can try these things in their newer systems so what I want to say about normal forms they exist you should know about them you should be able to reason about them and they're important again for you to be able to reason about maybe if the performance of a particular database is not doing what you think it should do you could look at the design and make decisions about whether we should denormalize things or normalize things typically that doesn't happen that often because if you break up a table into multiple tables I guess you can hide with views you may have to go back and change your application but you can hide that with views so maybe that's not a big deal and then again there's really no magic formula that I can provide for you you can walk through and say what's the right amount of normalization for your application it typically depends and again as far as I know when people design applications nobody says I want to reach BCNF let's make sure we get that you basically design it in Django and Ruby on Rails and then whatever you end up with is good enough okay alright so that's it for theory for this course sort of true we'll talk about concurrential theory when we talk about transactions that's actually I don't want to say useful theory but that's actually good stuff that's important to know but from this point on starting in class on Monday we're really going to start focusing on now the ability from the ground up what you actually need to build a database system so we're going to talk about storage management and this can be related to the first project that went out on Monday and we're going to go through step by step and say here's all the things you need to build up to build a classic disk-based database system so the other thing I'll point out too also is that the first homework assignment the SQL queries is due tonight at midnight and then after class we will be releasing the second homework assignment which is a pencil and paper assignment on the normal forms basically walking through the algorithms to understand everything we talked about and for this you'll submit your PDF on grade scope so all of you if you're enrolled in the course you should have gotten an email last night saying that you're now part of the grade course course our grade scope course if you do not get this email and you think you are enrolled in the course then send me an email and we can add you and the idea of grade scope is basically you upload your PDF and then you'll get your grade and assessment of your homework through the website okay any questions did anybody have any problems with getting the project one to actually build on their machine you did you want to ask a question yes this one so the question is I said that in the case of this example here from an efficient standpoint it may actually be better to store the records packed together in a page but then so that's true it depends obviously on certain things right like if I have you know in this case here I only have one order but what do I do if I have multiple orders and maybe I have to pre-allocate space to make sure I can insert it in otherwise I have to copy the whole thing out and create a new entry so I'm not saying this is magically always the right thing to do but it does have the advantage that you don't have to do multiple page fetches but then what's your second part of the question your question is there a way to combine the logical normalization with the physical normalization yes so I think what you're trying to say is there are benefits between constant denormalization versus normalization is there a way to do sort of a hybrid thing where you get the best of both is that what you're saying yes okay so I would say writing your queries in a declarative way in SQL gives you the best of both worlds because you don't care how the thing is actually physically stored the database system can make that decision for you and the database system is always going to be the best position to figure out what the most efficient way to store your data is because it knows what the data is and it knows what queries you're going to execute on it so in this case here I'm packing these things these things together but that may actually be a bad idea so the data system could decide that it doesn't want to denormalize like this and actually wants to normalize them but then I don't have to change my SQL query it still works the same whereas if I write it in a MongoDB query style I do have to go change it it looks like you have to normalize in a logical style yes the question is can we normalize at the can we denormalize at the logical level but then normalize at the physical level can you sort of reverse the two the question is is it possible for MongoDB to support SQL queries yes well they would say no actually so a better example Postgres supports JSON and actually most major commercial databases support XML XML data were a big thing in the early 2000s so all of them have support and there's actually support in the SQL standard for XML and XML again you're essentially defining exactly the some ways the physical layout of your data so I think what you're asking is can you have a document database run SQL queries on JSON is that what you're saying the answer is yes but you would still I would argue that the less information about the higher data is being denormalized that you expose in your application the easier it is for you to actually maintain some cases it's unavoidable you have to have you're splitting tables and two separate things again you can hide these things with views in some ways I'm being nebulous because there's no clear answer and you can kind of do anything and everything and I would say in my opinion this is the better way to go but most people don't write applications writing raw SQL queries you use an ORM that does it for you so this all might just be a moot point in the end relational purist this is the better way we can talk more about offline any other questions have a good weekend and I'll see you on Monday