 So, today's lecture is going to be, was actually originally just to be a single lecture, but when I was working over the slides this weekend, I realized that it's kind of hard to cram everything I wanted to talk about in a single class, so I ended up breaking into two lectures. This is sort of the last part of the course that will be about what I'll call theory, in quotes. I mean, we're not talking about system implementation issues, we're talking about some high level concepts. And the thing that we really care about is what we'll cover on Wednesday, which is called the normal forms. And the reason why it's actually worth talking about, even though this course is supposed to be a systems course, the reason why it's actually worth talking about the normal forms is because this is one of the concepts that the NoSQL guys will use to argue why their approach is better than the relational model approach or the relational databases. There'll be two aspects of it. It'll be that they're going to denormalize everything, and again, you'll learn what that is on Wednesday, and as well, they're going to give up some of the asset guarantees of your transactions, which we'll learn about later in the semester. So, in order to understand and reason about whether they're justification for denormalizing their databases correct, you actually need to know what the normal forms are. But the problem is, in order to understand the normal forms, you have to understand schema decomposition. To understand schema decomposition, you have to understand functional dependencies, which is why we're going to spend today talking about the first part of the precursor to understanding the background information which you need to understand the normal forms, the functional dependencies. So, it's a bit dry, it's a bit theory-esque, but again, I think it's important. So, the goal of this week is for us to understand and reason about what it means to have a good database schema, good database design. And when I say good database design, I mean the so the high-level logical construct of our schema. I'm not talking about what data structures we use, what algorithms we're going to use to do joins and things like that. That's called physical database design. We're not worried about that until later. This is, again, this is at the higher level. You as the application programmer need to define the schema in order to store data in your database. So, we're dealing with this at the logical level. So, there's two issues or two ways you can define the goodness of a schema. The first is that we want to make sure that we don't lose any data, that we have sound database integrity, meaning we put data in, we get that same data out. Then we also want to get good performance. So, we're not going to focus on this just now in this class on Wednesday. It'll come up in certain cases when we talk about assertions, but for us, we're not worried about, you know, how fast our databases can perform based on our schema just yet. We really care about the first issue, right, because it would be bad if to put data into a database. You're storing a bank and you have everyone's bank information. You put data in and you don't get that same data out. People get pissed when you start losing their money. This idea of sort of correctness versus performance will come up again when we talk about transactions and the relational database purists, which I consider myself in that camp, would be like, yeah, of course, you want to run everything with full protections and full data integrity. We'll worry about performance later, whereas the NoSQL guys, for example, would say, oh, no, no, we care about scalability. We care about performance and we're okay if some of your data gets lost or some of your data gets dirty. So for today, we're going to focus on the integrity side of things, but then this will be sort of a reoccurring theme that comes up throughout the semester of this tradeoff between, you know, the integrity of your database versus the performance of the transactions you want to run on it. So let's look at a toy example here. Let's say that we have a database of keeping track of all the students taking classes in our university, and we just put everything in our database into a single relation, a single table. So the single relation called students has the student ID, the course ID that they took, the room that the class was taught in, and the name and the address of the student. So we just took everything that we could have in our application and we just threw it into our single database. So what are some problems with this? Yes? Right. So we have data duplication here, right? So the first thing we have is we have, we're duplicating the room number for the course all over again, and then the thing he pointed out is that we're duplicating the Obama record, right, because Obama took two classes, and so we have to duplicate his name and duplicate his address. So this is an example of a relation that we can reason about the properties of it based on whether we have duplicate data and whether we have other issues. And so this boils down to this idea of redundancy in our schema, of redundancy in our database. And this can then, because we have redundant information, there's some certain anomalies that can occur because of this. The first is that we can have an update anomaly where if the room number changes for a course, then we need to go through and make sure that we update all the records of the students that took that course and update their room numbers. And the database system can't, you know, you can finagle it to do this for you automatically, but there's no declarative way to do this, make sure that you update one record and it propagates that updates to any copies of it, because it doesn't know that the value of that record of that attribute should be the same or linked together with other values in different tuples. The second problem is that we have, it makes it tricky to do inserts. So in my example that I showed before of the giant single table, every student had to be enrolled in a course, right? Because I couldn't put a student in without having a course ID. So that means that if you're a new graduate student or new freshman and you show up to our university, you're not even on the register's roll until you actually register in a course. The question is of course how can you register for a course if you're not a student, right? So this, this, this is the problem here. And the last one is that if now we delete all the students that are enrolled in the course, that course information essentially disappears, right? So if I delete all the students that are enrolled in the course, then I lose the room number because all the records for those students had the room number in it, right? So ideally what we actually want to do is something like this. Well, we'll split up or decompose a, a, that giant single relation into sort of subrelations and then they'll be linked together based on the student ID. So this now provides us the independence that we need to avoid all the anomalies that I just talked about in the last slide, right? I can insert students into the student table without them being registered in the course. I can add a, a, a course to my list of rooms and provide a room number, even though there's no student enrolled in it yet. And then I can have, students have a grade whether or not they're enrolled in the course or not. So the, the, the student table and the rooms table or relations are now independent of each other and we can avoid these anomalies. Furthermore, we have, we don't have a duplicate information anymore, right? We only have Obama once and if we need to update his address, we only have to update a single tuple. So this is called decomposition. And this decomposition of that, of that original relation I showed you before is better, right? And it's sort of obvious to us as humans look at this, yeah, this is the better way to do it. But the goal for this lecture is actually to understand at a more fundamental level why it's better. I mean, reason about in a more principal manner why this is better than, than, than the first one. So that's sort of the, the goal, what we're trying to achieve today. So to do this, the first thing we need to cover are functional dependencies. These are essentially constraints that you have within attributes within a single tuple. And then from then, we can start computing what's called the canonical cover, which is the minimum set of functional dependencies you need to reason about whether an instance of a schema is correct. And then I'll, I'll keep track of time because I want to focus on, I want to spend some time at the end talking about the first project which is going out today. But we'll cover now using functional dependencies to perform decompositions of schemas. If we, if we cover everything, great. If not, it'll roll over into Wednesday's class and then it'll, it'll, it'll lead into when we start talking about the, the normal forms. All right. So a functional dependency is part of the relational model. And it's essentially a form of constraint. And that defines the, the value of one attribute is implied by the value of another attribute. And you write it sort of, and you define them like this. X implies Y, X arrow Y. So this is saying that the value of the attribute X for a tuple functionally defines the value of the attribute Y. So more formally, you can sort of write it like this. But the basic idea is that if you have two tuples, right, and they both have attribute X and Y, for the first tuple, if it has a certain value for X, then it'll have a certain value for Y. And then other tuple has the same value for X. It has to have that same value for Y. Right? Let's look at an example. So say we have our relation. We just have the student ID, the name, and the address. And the student ID is the primary key for this relation. So we know that we have a functional dependency from student ID to name. Right? And this is saying that the student ID implies the value of the student ID attribute implies the value of the name attribute in the tuple. Right? In this case here, it's sort of trivial because the student ID is the primary key. So there can be one and only one value of a student ID in the entire relation. And for each single tuple with that student ID, it's going to have a single name. Right? So again, the student ID implies the name to this. But you may be thinking like, all right, now I look at this particular example here, and I can see other possible functional dependencies. Right? So this is valid, but I also maybe had something like this. Maybe I could say the name implies address. So certainly for this instance of this schema, or the instance of this relation, remember the schema defines the logical conception of a relation. And then an instance of a relation is actually the real tuples that get stored in the actual table itself. So for this particular instance, this functional dependency holds because there's only four students. They all have a unique name and they all have a unique address. Right? But we can't always assume that this is going to hold for any possible instance of this relation because if a new student comes along, it's me back in grad school. So if I start my record with my name Andy, I have a different student ID, so that's valid. And for whatever reason, I have my address back in the east coast. Right? This functional dependency no longer holds anymore because I now have two tuples of the value Andy and they each have a different address. So in actuality, we can't make any claims about functional dependencies just seeing a single instance of a relation. Right? We have to be told that this instance, we have to be told that this schema has this functional dependency. Right? And we'll use some logic later on to actually extrapolate additional functional dependencies, but we want to make sure that we follow the rules for how we do this so that we don't have incorrect things here. Right? So there's some tricks you can do. We'll see in a second of how to derive and extrapolate additional functional dependencies. But there's some easy tricks to do things like x implies y and x implies z, so therefore you can write it as x implies y and z. So you can combine things on the right-hand side, but you can't do the reverse for the left-hand side. So if you have x and y implies z, it's not correct to say x implies z and y implies z because you have to have the unique combination of x and y together in order to apply z. So you can explode the right-hand side, you can't explode the left-hand side. All right, so these seems like these are kind of interesting, and as I said, they're part of the original relational model, so it's not any surprise that this is actually in the SQL standard. Yes? So his question is, if I'm saying x implies y, does that implicitly mean that x is the primary key? No. These are functional dependencies that are outside of what you're defining as the primary key. We'll talk about indexes later, but you can define unique indexes that aren't the primary key. But we're not even talking about primary keys at this point. It's just saying that this functional dependency has to be held true. Now, it may be the case that often what you define as your functional dependencies will end up being the primary key, but it's not always the case. And actually to be more exact, they will be the candidate key or the super key, which we'll get to that in a few slides. All right, so as I said, the functional dependencies are part of the original relational model. And in the SQL standard, you can actually define constraints or assertions that essentially act the same as ways of functional dependency. So here we say our simple functional dependency of the student ID implies name. In the SQL standard, we can call create assertion. And then we have our check clause where we specify what query we want to run anytime you want to validate the instance of our database. And as long as this thing returns false, or sorry, returns true in our check statement, which means that this query has to return false, then we know that our instance of our relation, our table, is valid. And you sort of extend this and do more complicated things. Say you want to do the student ID implies name and student ID applies address, then we can just add in an additional disjunctive clause in our where clause inside the check statement to make sure that if there's two students with the same student ID, they don't have the same name or they don't have the same address. And again, every single time that you insert, update, and delete our tuple, it will run this to check to see whether your instance is valid and throw an error if it's not. Yes. So this question is, is this notably different than using the unique attribute? The answer is yes, because this can be an arbitrary query. I'm showing a simple example where I'm only doing query on one table, but it can be anything. It can be multiple tables, it can be new joins, it can do whatever you want. It always has to return true anytime you modify the table. It's a good question. All right, so there's two problems with this, with these assertions. Let me take a guess what they are. In the back, yes. Excellent, yes. So he said it has to run through the entire table every single time to check to see whether this is valid. Right, so he's absolutely right. So if I update a single tuple in the students table, I have to run this query again and check to see whether the instance is still hold, whether I'm even allowed to make that particular change. So this is only accessing a single table, but if you were doing a join, sort of the example that he brought up, now you've got to join between two different tables or multiple tables. Now you can try to say, all right, maybe I could be a little bit smart about this and say, oh, well, I can reason that I'm only doing a query on the students table. So anytime anybody inserts updates to the students table, I'll only run this assertion. If you update the other tables, I don't care. Or then you may say, I'm a little bit smarter, maybe I only run this for the tuples that I actually got modified. Yes, you could try to do that, but you have to be pretty sophisticated in order to make this work. And it's hard to do this for other things. If you have an aggregation, then you have to run it for everything. I can put any arbitrary query I want in this check statement. Everything, I guess, what the second problem is. It has to be a select statement. It has to return true. So you can't, yeah, you could have, yes, UDS could do this, yes. But we're not there yet. Anybody else? No, this actually supports this. It's in the SQL standard. Here's the standard from 1992. It says, yeah, it's great. Here's an assertion. Here's what it can do. But as far as I know, no database system actually supports that create assertion syntax. If you Google this, it'll show up in a textbook written in 1999 or so about the SQL99 standard. It says, oh, here's all the syntax to do this. There's an online poll ran by Oracle on their DBA website asking in 2016, hey, we're thinking about adding these global assertions. Would you guys actually want it? Of course, everyone says yes, but I don't think it actually got implemented. Again, as far as I know, no database system actually supports this. You can finagle it sort of using unique constraints or triggers and other things, but the create assertion syntax as I showed before does not actually work anywhere. So the closest thing I could find was in DB2 where in the create table syntax, you can add a constraint where you specify that the value of a single attribute will be determined by another single attribute. So in this case here, if I want to define the functional dependencies, student ID implies name, I define my constraint and I say check the value of name and it's determined by the value of student ID to make sure it's a unique match. Now, you could also make this in the way that he sort of talked about where you can make a unique constraint of student ID and student name together and that way you always have a unique combination, but it's not exactly the same. So again, these are nice to have, but you actually can't really use them widely. So now the question is why do I care? If you can't use the SQL, why am I wasting your time sitting and talking about them? Well, again, as I said in the beginning, we're going to need these functional dependencies to actually determine whether our database design is correct. By database design, I mean taking the relation that we're given and deciding how to decompose it into subrelations. And this is a completely separate decision from whether we're actually going to get good performance or not. We care about the integrity of our data. So if we're given a much provided functional dependencies, we actually want to now derive what are called implied functional dependencies. So I showed that example before where I said, well, you can map the name to the address and I said that was not valid because it may look correct for that particular instance of the database, but it's not correct for all possible instances of that, for that schema or that relation. But now with these implied dependencies, we can rely on our provided dependencies and extrapolate new ones and kind of condense them down to a more simple form. So we're given, again, the student ID implies name and address. And again, because of the transitive rule, we can say that student ID implies name and student ID implies address. For shorthand notation, we're combining them together. And then we have student ID and course ID together implies the student's grade in the course. So as I showed in a second, you can actually derive additional functional dependencies from these. So if you have student ID and course ID implies grade, then you also have sort of basic things like student ID, course ID implies the student ID, student ID, course ID implies the course ID, right, and all these kind of things here. And we're going to use these to, again, delay the reason about our decompositions. So now what we want to do is now we want to say, given a set of functional dependencies, we want to decide whether we have a new functional dependency is actually valid for that set, for that schema. And so for this, we compute what's called the closure. So we're thinking this is like an explosion of all possible implied dependencies given our provided functional dependencies. And we're going to use what are called Armstrong's axioms to do this. So I don't want to go through the nitty gritty details of how all this works. All of you should have taken some basic logic course or discrete math at some point in your life. And so these are pretty obvious, pretty straightforward. If you have x and y, x implies y, you throw z in there, xz implies yz. Simple things like that. So we're going to use these six rules to take our provided functional dependencies and compute the closure, the implied ones. So let's do this. So let's now, again, set of functional dependencies f defined by f1 to fn. And we want to define what's called the closure is f plus, the syntax we'd use for this. So we take our schema and our two provided functional dependencies and then we can derive all the implied ones using Armstrong's axioms, the rules I showed in the last slide. Again, these are pretty straightforward. And then, again, the combined total of all the implied functional dependencies is called now the closure. Now you're like, why do I care about the closure? So if you have the closure of functional dependencies on the relation, then now you compute what's called the attribute closure, which is the attribute closure is where you say, for a given attribute x, you want to find all the attributes that can be inferred from x in that schema. So if I have a value of x and I have attributes a, b, and c in my tuple, if I have x, then based on the functional dependencies and computing the closure, I know that I can get y, so I can get a, I can get b, I can get c. So that's what the attribute closure is. We can derive from the total closure of the functional dependencies. So again, to check to see whether we have x implies y, we can compute the closure on x, and then we check to see whether the attribute is inside of the attribute closure on x. So again, why do you care about the attribute closure? Right? So as we said in earlier, we said that computing the, checking all the constraints or functional dependencies every single time we modify a relation or table is really expensive to do. So ideally, what we want is we want to get a minimal set of functional dependencies that we have to check in short to make sure that our relation is actually correct. So again, the idea is that you could be given a bunch of functional dependencies. All right, think of this as like your, your database designer or application developer says, here's the database we want, we want to set up just to store data for our new application, and they're going to find all these functional dependencies. But there may be some redundancies, there may be some extra stuff that you maybe don't want to need to care about so much. So you want to compute the closure and then find the minimal set of functional dependencies from that, that you only need to enforce. So that way you're not wasting time checking things you don't care about. Question in the back? So this question is, can you have circular dependencies? Yes, we'll see that in a second. But when you compute the minimal set, which is called the canonical cover, you will remove these circular dependencies. So the minimal set of functional dependencies that you derive from the closure is called the canonical cover. And you use the syntax of finding FSC. So you say, given a set of functional dependencies F defined here, we want to come up with what's the minimum we actually need. So these four here are what we're provided. And then using Armstrong's axioms, we can find out that only these first two here are the ones we actually care about. And it's pretty simple to see this. So given a student ID and name for the third one, I get the name and the address. Well, again, we know from transitivity that if you have the name here, you also have the name there. So it's implied that the student ID can give you the name. But then that's the same as up here because we can explode those out into single attribute functional dependencies. Yes. So his question is, is the canonical cover always going to be as for a given relation with a set of functional dependencies, will the canonical cover always be the same? My hunch is yes, I think yes. So yeah, I think there's yes because you can keep repeating the process we'll show in the next couple of slides, keep refining it to get it down to a smaller set. And I think you should always end up with the same thing. Is that true? Yeah, I think that's true. Yes. Okay. All right. So how do you compute the canonical cover? Well, the first thing I understand, what are the properties we need to hold in our canonical cover? And for this, in the shorthand notation, I'll say right-hand side, left-hand side is implied here by this example. So the canonical cover will be one where the right-hand side of every functional dependency in our set is always going to be a single attribute. And then the closure of the canonical cover has to be identical to the closure of the original provided functional dependency set. So again, the closure is the one where we use Armstrong's axioms to sort of explode the number of functional dependencies that we have. And the canonical cover will be one where if you take the closure of the canonical cover and the closure of the original set, they have to be equivalent. And it will say that our canonical cover is the minimal set because if we eliminate any attribute from either right-hand side or left-hand side of any functional dependency in our canonical cover, then the closures of the canonical cover will no longer be equivalent to the original functional set. So it's a bit dry, but we walk through an example. So the way we're going to compute this is basically for every single set of functional dependencies that we have, we're going to drop either an attribute on the left-hand side, the right-hand side, or remove any redundant functional dependencies and make sure that any functional dependency only has a single attribute on the right-hand side. And we're just going to keep doing this over and over again until we can't do it anymore. So let's walk through an example. So today I'm giving these four functional dependencies. a implies c, a implies bc, b implies c, and a implies b. So the first thing we can see is that we want to remove the number of, we want to split up any functional dependencies where on the right-hand side they have two attributes. And again, we said we could do this based on Armstrong's axioms, that we could take a implies bc, and we can split them up so that the a implies b and a implies c. Those are equivalent. So now we have five functional dependencies. So now we no longer have any attributes or any functional dependencies where the right-hand side has two attributes. So now we want to apply our other rules to try to find redundancies. So the first thing is obvious here. We have the functional dependency a implies b twice, so we can go ahead and remove one of them. That's pretty easy. Then we come back around here, and now we see that we have a circular dependency that he was talking about before. So we have a implies c, and then a implies b and b implies c. And so in this case here, if you have a, then you can get b. But if you have b, then you can get c. So it's trivial that if you have a, then you can get c. So we can remove the first guy. And now the same thing here. So now we want to try to remove the number of, we'll remove any functional dependencies with more than one attribute on the left-hand side. So in this case here, we have a, b implies c, and then below that we have a implies b. So if you have a, you have b. So obviously if you have a implies, a and b implies c, that's basically b, b implies c. And so you can remove the a from the first one. And now we obviously see that we have another set of functional dependencies that are redundant. So we can go ahead and remove the first one. Can we go any further? No. So we can go through now the checklist that we had to have and when we defined what their properties are for our canonical cover. And we can say, well the first one is that we know there's nothing extraneous. There's no redundant functional dependencies. All of the attributes on the right-hand side, all of these functional dependencies are only a single attribute. We don't have b implies c, d, and a, right? It's all single, single values. And the last one is that if we then explode the, our, our final set here and get the closure, it will be equivalent to the closure of the original set of functional dependencies that we were provided. We're not going to do it, but like, take my word for it. So once you have this, once you've satisfied these three properties, voila, you have the canonical cover. And in the back, yes. Raise your hand higher so I can see it. Yeah, so that's, so you can remove, you can remove those. You can move one of them. Yeah, you, you could have that. Yeah, why not? His question is if you have a implies b and b implies a. Yeah, I think actually you're right, you have, you have to keep those. His question is, does that make sense? In terms of what? Like the application? So, so in our rules, that's a valid, so his question is if you have a implies b and b implies a, is that allowed? The answer is yes. The second question is, does that make sense to have? It depends on the application, right? We're just dealing with a certain, an abstract sense, right? We're not, we don't know what the actual data is in our database. We don't actually know what they're trying to model in their application. So, technically that's correct, right? Think of this, like, you know, I have one email address, I have one social security number. If you have my social security number, you can get my email address. If you have my email address, you can get my social security number. That's correct. Question? Okay. So, again, we're not, in terms of whether that's, these functional dependencies are the right thing to do, we don't care. The application told us, right? The application developer, the design of the database told us this is what they wanted. This is what they wanted, our database system to enforce. Then it's up to us as people taking this course, who actually builds the database system to actually enforce this for them in an efficient manner. But as I said, no database system supports assertions, so nobody does. Okay. So, again, I'm seeing, you know, it's a lot of blank faces. Everyone's sort of going through this, going through the motions with me. So, why do you actually really care now about the canonical cover? Well, as I said, the canonical cover will be the minimum number, minimum number of assertions that we need to maintain and enforce in our database to make sure that the data is actually correct. And as he sort of said, you know, we don't care whether truly in the grand scheme of things, right, in reality, whether the data is actually correct. It's correct in the context of the relation that they defined of these functional dependencies. So now, if you have the canonical cover, you can find what's called the super key. This is not a term that I made up. This is not a sort of fake term. This is a real term that was, again, defined in the relational model in the 1970s. So, I'm going to go through three different definitions of the super key to try to hammer this down. If you Google super key, you'll find a bunch of different things, but it's really hard for me to still down exactly what it is. So, I'll go through the mathematical term, and then I'll go finish up with, you know, the system side of thing, what it looks like. So, the super key is defined as the set of any attribute in the relation, of any attributes in the relation that can be used to functionally determine the attributes of all the attributes in that tuple for that given relation. So, if I have, say, the most trivial super key would be all the attributes, because all the attributes can then be functionally used to determine the value of all the attributes. All right? Related to this is the candidate key, which we sort of talked about before, which is any super key where if you've removed just one of the attributes, then you can no longer functionally determine all the other attributes. So, now everything, what's the difference between a super key and a candidate key? Well, again, try to put in other terms. Think of it, the super key is having a set of attributes where there's no tuples that, two distinct tuples that will have the exact same values for those attributes. But then this candidate key will be the set of attributes that are uniquely defined and identify a single tuple according to some functional dependency or key constraint. So, a candidate key is a super key, but not all super keys are candidate keys. Again, more blank faces. Let's try again. Okay. So, the super key is the set of attributes that uniquely identify a tuple. It can be one, it can be all, anything, right? The candidate key is the minimal set of attributes that we have that can uniquely identify a single tuple. So, again, if I have 10 attributes, one super key could be all those 10 attributes, but the candidate key could just be just one of them. And that's enough for me to identify the tuple and get the values for all the other ones. And then so, the primary key essentially is just usually just the candidate key. Yes. This question is, why do you need a super key? I don't know. This is what the, this is the math says. The relational model paper says you have the super key and there's candidate keys and they're distinct. This question is, is a candidate key a subset of the super keys? Yes. It's the minimal. If you take away one attribute from the candidate key, then you can't find everything else, right? Trivially, like if you're, if you have a single attribute, that's the candidate key. And if you remove that single attribute, you can't find anything, right? That's like the extreme case. In the example before where I had the student ID and the course ID to find all the student's information, if you remove the course ID, then you can no longer uniquely identify an instance of, of a student taking a course. Yes. This question is, can you have multiple candidate keys? Yes. Okay. But why do you care about super keys? Right? Again, they're going to be used for us to determine when it's okay to decompose or split up a table into sub-tables. And these super keys are going to allow us to ensure that we, that we can always go back and get the original data we had in our, in our relation. So we need these to, to reason about whether our decomposition or, or normalization is what it's called. Splitting up these tables was the right thing to do or not. Yes. This question is, if you have a primary key in the relation, you can't have multiple candidate keys. No. Because you can, again, you can define unique keys that are independent of the primary key. We can pop open Postgres and do this now. Like, they're separate. The primary key, as I said at the beginning, is like, I don't think it's actually defined in the original relational model, but it's in practice, this is what everyone does. This is what, like, there's no super key or candidate key syntax in SQL. It's only primary key or unique keys. And in practice, the primary key is sort of one anointed candidate key that is more important than the other ones. Yes. You can have two unique, I mean, I draw on the chalkboard. You can have, you can have a candidate key that has three attributes and then another candidate key with another three attributes and those both can uniquely identify the tuple. Right, but then you pick one of those and that's the primary key. Makes sense? And again, this is all defined by the application. We're given these function dependencies, we're given the schema. We don't care about whether that's actually sound or correct from a sort of common sense standpoint. It's just whatever the application told us it wanted it to do. Okay, so we care about super keys because we're going to use them to decompose our tables into multiple sub-tables and then we want to use them to figure out whether we were putting things together back correctly. Yes, in the back. His question is, can a candidate key spawn multiple tables? Yes, we'll get that. Yes, a few slides. Correct, yes. This statement is unique keys that you define in your table are candidate keys. So everyone know what a unique key is? All right, he's shaking his head, no. All right, so we can pop them, we can pop them Postgres again. All right, everyone see that? All right, so I can do things like create. Actually, what tables do I have here? All right, so I'll do create table foo or test and I'll say it has a ID, that's an int, that's the primary key. All right, and then I'll have a value, that's an int, and I can say that this has to be unique. So that means that no two tuples can have the same primary key or sorry, same value. But I can also create a constraint where I say the ID, the combination of value, I think it's this, my syntax might be wrong, like that, where I can say the combination of the ID and value together has to be unique constraint. Yeah, that's not valid. So let's do this. Right, so now if I insert into test, values one, two, that's allowed. If I do one, three, will that fail? Yes or no? Do I want to fail? It should fail, right? Yeah. Right, do like a key, violates the primary key constraint. So now I say I do two, three, that's allowed. Now I do three, three, so does that fail? Right, because it violates that one. All right, so now we can also do now create, I think this is probably not going to work either. I forgot the exact syntaxes. Add unique constraint ID value. Nope. I think it's just create. Take my word for it, I have to look up the SQL standard, but there's a way to actually define a unique constraint that's a combination of multiple keys. Right, so actually you can do this for the primary key too, right? So drop table, test, if I go back and redefine it. Okay, and so I want the combination of ID and value together to be my primary key, so I can do primary key ID, Val. Right, so now if I go back and do my example where I insert, say one, three, that's okay. One, three, it will fail again, right, because it would duplicate our combination of the primary key. So take my word for it, you can also define a non-primary key unique constraint that could be the combination of the two of them. So if they look the same, right, they sort of act in the same way. Why is a primary key different than a sort of regular unique key? It again depends on how the database does things internally. We're not, well we won't talk about this yet, but we'll talk about this later. So in the case of my SQL, when you have secondary indexes, they actually store the primary key as the pointer to the actual tuple you want. And so if you, whatever you define as a primary key, that gets embedded as the value for the, for the, for a secondary index, and then when you do a look-up in that secondary index, it looks, takes the primary key, then does a look-up in the primary key index that then actually finds the tuple that you want. So there's, there's, it's more than just semantics that like, oh, it's, they're, they're doing the same thing, why do I care? Internally the data system will do different things. But that's not defined in the SQL standard, that's all done, you know, as, as how the system was built. Okay, so now we can do schema decompositions. So a decomposition is where we're going to split a single relation into multiple subrelations. And we're going to make sure, we want to reason about what the properties are of, of this, of this decomposition. So not all decompositions are going to make, are, are a good thing, or things we actually want to do. So remember beginning of the class, I talked about the, the three different anomalies, the insert, insert, update, delete anomalies that we want to avoid. But then there's also this extra issue now of wasted space. So in that first example, I showed how we had duplicate Obama records. We were duplicating Obama's address every single time. But now we split up our, our relations into subrelations. Now we need a way to know that these things are linked together. And now, because now we need to store the super key in order to combine them, to get back the original data. But now that may, may, may mean, may mean that we're storing the large super key over and over again in these different relations, which may end up being worse than, you know, you waste more space than we had when we just had the duplicate records. So there are three goals we want to have in our decomposition. The first of all, lossless joins. And this basically means that we don't want to lose any, any data when we join the tables back together. And so, you know, if you think of like a sort of lossless algorithms versus lossy compression algorithms, it's not exactly the same in that we're going to lose data. It's that we can introduce incorrect data and we actually lose the original form or the original content of, of our database. We also want to preserve dependencies. And this is related to the question he had up there, is that we want to make sure that we don't have dependencies span multiple relations, because then that means in order to enforce those constraints, we actually then have to do a join, which is expensive. So we want to minimize the sort of the cost of enforcing our global constraints based on our functional dependencies. But as I said before, nobody actually implements them. So it's a moot point in some ways. And then the last one is that we want to, we want to, we want to avoid having redone information, right? Because that just makes our database larger. It takes more space in memory, it takes more space on disk, and it's essentially costless more. So of these three things, which one, which you got one, which one do you guys think is the most important? Lawless joins, why? Hey, you don't, you don't want to lose data, right? So this one is mandatory. And I say mandatory though, actually coming from a, again, the, the, the, a relational database, a relational model purist. I care about this a lot, right? And most people when you, you know, would say, yeah, I probably care about that too. The second two are nice to have but not required. And we'll see this more when we talk about the normal forms is that you could come up with a decomposition where you don't have any lossless joins, but you maybe don't, you don't, you don't do so well in these other, other ones. And this last one here is actually kind of hard to find redundancy avoidance. There's no sort of exact term to say, yes, like, oh, I have zero redundancies, right? Low redundancies. It's sort of left up to the discretion of you as the database designer to decide what's the right amount of duplicate data you're allowed to have. Dependency preserving, again, as I said, nobody actually enforces these, so we don't care that much. But certainly the first one can matter a lot depending on your application. And again, we'll see this later in the semester when we talk about transactions because they're exactly going to make this, this sort of same trade-off, right? So these ones at the bottom, they're nice to have, but they're actually, can make your database run slower. The one at the top will make your database run slower, but you make, you know, you, you, your database, your data is, is correct. And when we talk about transactions, they're going to make the same trade-off. We can run our, our queries in our application in a way that we're guaranteeing that everything is going to be correct no matter what happens. And if we, if we crash, we have two guys trying to update the same thing at the same time, we'll make sure that no matter what, it's always going to be correct. But in a lot of applications, people don't maybe care that much about correctness. And actually by default, most of it is don't run with full, full protection. Most of you don't know, right? So this, but I say in terms of context of, of schema design, this one is actually really important. And the other ones are, you know, we'll talk about, but if you, if you achieve them, great. If not, it's not, it's not a big deal. All right, so let's look, let's look a little, what it means to have a lossless decomposition. So for this, I want to use another relation, another example. This time we'll model a bank and we'll have a start, a single relation that has the information about all the loans that this bank has, has, has made. And so we'll have the, the branch name of the bank, the city that the bank branch is located in, the current assets of that branch, and then the customer name, the loan ID, and the, and the amount of the loan, right? So we'll be provided with these functional dependencies seen up here. The branch name implies the city and the assets, and the loan ID implies the loan amount and the branch name. So let's say that we want to decompose this relation. We want, you know, it's, it's inefficient to have all this redundant information, right? If you see, we're duplicating the branch name and the city and the assets over and over again. So we want to try to decompose this in a way that reduces this redundancy. So for this, let's just pick the customer name as an example to do this, right? So we'll split up our single relation now into two separate relations that look like this. So now, if we want to take these two relations and do a natural join and put them back together using the customer name, we end up with a table that looks like this. And what's obviously wrong about this? Right, exactly right. There's, there's two records here where I took a loan out in Compton for $1,000, and I have a loan in, in Pittsburgh for $500 that didn't exist that we have, we had before. Again, because we're doing the natural join on the customer name. So this is an example of a bad decomposition, right? We're not, we're not even talking about, you know, violating the functional dependencies at this point. It's just we know that if we do a natural join and put these guys back together, we get, we get garbage. So this is bad. So let's say now instead we try to, we decompose them based on the branch name. So the branch name can, from the first table can be used to join together with the second table. And then we have the same problem, right? When we do, when we do our natural join, we get now three records that didn't exist before, which is garbage data. Right? So this, this is an example of a decomposition that's not lossless. So here's one that, that is lossless. So now if we split them up based on the loan ID, when we do our join and we put, and we put it back together, we get exactly the, the same table that we had before. So this is, this is, this is what we want to achieve when we decide how we want to split this up. And this is implying that the, the loan ID is, is our super key. So this is good. We want this. All right. So the next thing we got to care about is dependency preservation. So it will say that a schema preserves its dependencies if the, each function dependency does not span multiple tables and multiple relations. And the reason why we said this, this matters is because if we actually had true assertions, we could, we would have to do a join between two tables every single time either one of those two tables is, are, are modified. And again, we can implement this at the application level by running these queries, you know, every single time to check to see whether things are correct. That would be even more inefficient. That would be bad to do. So let's look, look, go back to our, the, the lossless one we had before, where we split it up on loan ID. And we see here is that the branch name implies the branch city and to, to enforce that function dependency, we'd have to do a join from the first table to the second table. Right? And that, that's, that's a bad thing. And then the same thing for the loan ID implies the amount and the branch name, we have to go back now in the other direction and do a comparison that way. So this is technically called not that, this decomposition is not dependency preserving because the scope of each function dependency exceeds one relation. And so to test this, you have to, this is where now the, the canonical cover and the, and the closure comes into play. So we want to sort of mathematically test to see whether a decomposition can preserve the function dependencies. We first compute the closure on the given set, then we compute a new set of function dependencies, G, that's a union of all the sets covered by each, each subrelation. Then we compute the closure on G, check to see whether it matches the closure on F and that their equivalent. Well, we know that we're, we're, we're preserving our dependencies. So let's walk through an example like this. So we have two relations, R1, R2. R1 has ABC, R2 has C and D, and then we have three function dependencies in our set. So then the first thing we do, we can compute, we use Armstrong's axioms to compute the, the closure on F. Then we take the, compute this new set of function dependencies, G, where the, the first item will be, first set will be the function dependencies covered by R. And by covered it just means that I have all the attributes that I need for that function dependency in my relation. So A implies B in, in R1 I have both attributes A and B and therefore that's covered by this. And the same thing for C and D for R2. Then you compute the closure on G and then check to see whether it's equal to, to the closure on F. In this case it's not because we have a function dependency A implies D that is in the closure F but not the closure of G. So therefore this, this decomposition is not dependency preserving. Take another example. So now we, instead of having C in relation R1 we'll put in D and we just do all the same steps that we did before. I'm not going to go through in all the detail. Well at the end you would find that the closure on F is equivalent to closure on G. So therefore this decomposition is, is dependency preserving. All right the last one is redundancy avoidance. This is basically, sort of again, you just basically say that if you have a function dependency X implies Y that is, that is within a subrelation and X is not considered to be the super key, then you're considered to have some duplicate data that you don't need because you, you could derive this from another subrelation in your, in your decomposition. I'm not going to go through the, through an example but just take my word for one. This is what it means to have redundancies. All right so just because this, this will show up on the exam. These are some nice slides that give you a, for each of those three properties we care about. Here's, here's why we want to do it. Here's what you're trying to achieve and then here's the, here's sort of the, the test or algorithm you can use to, to execute these. So we have losses joins, dependency preservation, and then our, our redundancy avoidance. The second time I'm going through this very briefly but it's, to read more about it in the textbook and it'll show up in the homework that we give out on Wednesday. Okay, any questions? Okay, so that's all I have to say about function dependencies at this point. They'll, they'll show up again and these decompositions will show up again on Wednesday when we talk about the normal forms. Right, the normal forms are a way, they're different decomposition levels you can have for your relational schema that you can define based on those three properties that it showed. Lossless joins, dependency preservation, and redundancy. And in practice you try to achieve two possible levels of, of, of normal forms. There are ones that are below it and ones above it. Nobody really cares about those guys too much. So again, what I talk about today is going to allow us to reason about our decompositions and to check to see what normal form a particular relational schema is in. So any questions about function dependencies? In the back, yes. This question is what is the complexity of determining whether something preserves the function dependency? Like the asymptotic complexity? Like, I mean, it depends, it depends on, depends on the, I mean, it's, so it's not exponential, but it's not linear. It depends on what the function dependency, what you're trying to check. And sometimes it also depends on the access method you want to, you want to use and how your database is physically designed. So if you have to scan every single tuple, then it's linear, right? Because it's, it depends on the number of tuples you have. But if there's a B plus tree, for example, that's N log N, so that's sub-linear, or sorry, log N, it depends. If you're joining bunch of things together, then they have a different, different complexity. But again, think, so the reason why I think the major database vendors don't support assertions is because you could put any arbitrary query in there and you'd have to go run that query every single time. Anybody modifies the tuple, or sorry, the table, or any table in your database possibly, and that would be really expensive to do. This will come up when we talk about transactions, but the longer you run your transaction, the more conflicts you're going to have and the slower your database is going to go. So it's typically why they don't, they don't want to do this. All right. And again, we got to get through this and we got to get through normal forms on Wednesday, then we can get to the good stuff and then we can get to like actual building the database system. But I think this is important because you have to see normal forms at some point in your academic career because they will probably arise the question about what's the right way to design a database or arise at some point in your life. And then now you can say, oh yeah, normal forms. Let me go read the Wikipedia page because we talked about it 10 years ago. Okay. All right. So in the last 10 minutes, let's talk about Project One. So Project One is going out today. The website is up. I think we're, after class, we'll post the actual tarball. You can download with the source code. But all the projects in this course will be on, the ultimate goal is that you're going to build your own storage manager in your database. Guys, it's too much murmuring. Okay. Sorry. I don't know whether you're like, oh my gosh, this is terrible. Oh my God, this is so exciting. We're going to work on Secret Light. You can complain afterwards. Okay. All right. So the goal of this semester is that you're going to build your own storage manager. And the storage manager is the thing that actually stores the database. And make sure that if you pull the plug on the system and you come back and everything's still there. So the idea is that you're going to be building four projects, sort of building upon each project and expanding what your storage manager can actually do. And so for the first project, you're going to be basically building the buffer pool manager, right? And the way you build your buffer pool manager is that you need sort of a hash table and you need some way to decide how to move pages in and out. So you're going to be building all those sort of those subcomponents. And all the projects will be done in the context of Secret Light, which is the most widely used database in the world. It's on everything. It's on your phone. It's in space. It's everywhere. And you're not going to be actually hacking on the real Secret Light code because although beautiful as it is, it is very complicated. But they provide a nice API that sort of allows you to use Secret Light as a front end and then it can call down into your storage manager that you guys are going to build to run queries and get data and update things, right? So the project is going out today. I bumped up the due date to now be on October 2nd, right? Because of the, you know, because I split the normal form lecture into two classes, so everything's got, this project got split up by one. So just be mindful of that. So there's three components you need to build. And you sort of, you roughly want to build it in this order. Again, please. Hey. I don't know. Is it excitement or are you angry about this? Is it excitement? Okay, good. All right. So the first thing you need to build is an extendable hash table. So extendable hash table we'll cover, I think next week, is a type of hash table that allows for you to grow in size. So you don't have to define exactly how many elements you're going to store ahead of time. It has a method for actually expanding itself. And so you're going to want to build your hash table to store the use, that should be unordered. Let me fix that real quick. So you want to store unordered buckets of key value pairs. And then you'll follow the extendable hash algorithm to grow the hash table in size as you insert new elements. And so you're not going to need to support shrinking because that's actually nontrivial to do. So you don't have to worry about, you know, compacting and as you delete things, you give back memory. You only have to worry about growing. And so for this, yes, you could use SDL hash map in C++, but it's kind of worth, you know, we want you guys to actually build your own. And so you don't have to implement your own hash function. You should use SCD hash. And then you actually need to make sure that your hash table is thread safe because you can have multiple threads accessing your buffable manager and asking for entries. And so for this, you want to use SCD mutex. It's not, for this course, that's fine. If you take my advanced course in the spring, if you use SCD mutex, it will fail you because they're actually really slow. I wore joy about this for many years and he didn't believe me and we're still trying to remove them from our database system for the other one we're building. So for this, you can use SCD mutex. It's fine. You need to make sure that you're thread safe. The next thing you need to build is the, what we'll call the LRU replacer. Basically, this is the component in the system that is responsible for keeping track of how pages are used. And we're just going to use least recently used LRU. And the idea is that when you run out of memory and you need to say, all right, I need to take a page and evict it from memory and write it back out the disk. Which one should I use? This thing will tell you to use the last one that was, the one that hasn't been used in the most recent amount of time. We'll discuss different a buffer pull policy algorithms you can use next week. But for this, we're using LRU because it's the most simple one. So the one thing you need to be mindful about in your implementation is that the buffer pull manager, which I'll show in the next slide, needs to keep track of what pages are pinned. Pinned basically means that some threads says I'm using this page, do not write it out the disk. So the LRU replacer you're going to build doesn't need anything about pinning things. It just needs to know here's a page it was used and sort of update an internal data structure to keep track of this. And it's up to you to decide for this, for this component whether you want to use your extendable hash table, whether you trust your own code to use it inside the LRU, LRU replacer, or you want to, you know, again use STL containers if you wanted to. And then the last thing you want to build is the buffer pull manager. So you're basically now taking the extendable hash table that you implemented before and then the LRU replacer and will provide you with the class file to drop these guys in. And this is the part of the system that answers requests from other threads running in the database system to say, I need, you know, give me page one, two, three, give me page four, or give me a new page. I need to need to write some data. And this thing is responsible for actually managing the database that's in memory. So the buffer pull manager essentially allows the database system to support databases that are larger than the amount of memory that's available to the system, right? Because otherwise, if you just take the entire database and suck it in memory, you can run really, really fast. You don't worry about anything getting written out of the disk. And that's what the advanced class will focus on in the spring because you can do, there's a lot of tricks you can do to make these things run really, really fast. Whereas this course is sort of the classic disk-based database system. So this thing is basically using the other way to figure out, all right, if I need to make new space, what page was last used or which page has not been used in the nearest amount of time, and we write that out the disk and then I can free things up. So we will provide you with the class files sort of already set up that are hooked into our disk manager. And the disk manager is the thing that actually does the read and write out to disk. You're just sort of implementing the in-memory data structure and all the tracking information. So the thing that you do need to be careful about in your buffer pool manager is that make sure that you, when you make selections about when you write things out the disk or, you know, when you want to free up space, that you keep track of when things are actually being pinned correctly. So that means that you need to make sure your buffer pool manager is thread safe, right? Because if someone comes along and says, I want to pin this page and then, you know, in between that time, somebody comes along and starts writing data into that page. Sorry. Someone comes along and says, I want to pin this page and then someone else says, I need to make space and you go evict that and then you pin that page and give it back to another thread. But now that the block of the memory that thing is pointing to is now garbage or now some other page got swapped in. So now your database system will be corrupting itself. So we'll make sure you do this correctly. And again, you can use STD mutex to make this go. So I think what we'll try to do in the, on AutoLab, we'll try to do a little sort of micro benchmark to see how fast your implementation actually is. So the cheapest thing to do, or the easiest thing to do is to put STD mutexes everywhere. But then you're essentially, you're pegged to a single thread. So if you meet a more sophisticated or nuance of how you used your latches, then you can get better performance. All right. So to get started, the web page is up. The tarball file for the download is not up. We'll do that after the class. We tested this on the Android machines, but in the sort of general pool for the university, as well as the SCS ones on the lab on the fourth, fifth floor. It also runs on OSX. It runs on basic Unix. There really aren't any dependencies that we're bringing in. SQL lights designed to run everywhere. And our code that we're providing you guys for your storage manager should be pretty portable. In theory, also, too, is should compile on Windows 10 using the Ubuntu package. We haven't tried. But if someone tries that and it works, let me know because we can post that saying this is available for everyone. If it does not work on your laptop or does not work on some kind of engine machine, please send us an email. We can figure out what's wrong. It should work. This should be portable. You should not have problems. All right. So other things to note real quickly is that in the project spec, it'll tell you which six files you need to modify. You should only modify those six files. If you think you need to modify other things in order to get this thing to work, you're wrong. Trust me. So if you modify other files other than the six one that we tell you to modify, then when you get the update for the second project, those changes will get overwritten and it'll break your implementation. The other thing to point out, too, is that all these projects are cumulative. So that means that you're going to take your Buffable Manager from this project and you're going to use it again for your second project and third project and fourth project. So if you bomb this one, you're in trouble. Because we can't provide solutions because we want to be able to reuse these projects every single year. So this is essentially what they do in the OS course as well. It's cruel, but it is what it is. This is CMU, so I think you guys are up for the challenge. And the last thing to also say is that any questions, high-level questions about your implementation or clarifications about the project specification, please post them on Canvas. If you have a question, somebody else might have the same question, and that way we don't have to answer it multiple times. You can also come to TA Office Hours and ask other things. But we're not going to help you debug your project. That means if you come to Office Hours and say I'm hitting a seg fault in your own code, we're not going to teach you how to use GDB and figure out what's going on. Okay? Again, this is an upper-level course. This is CMU. Okay, sorry. That's on video too. We'll bleep that out. Okay? Is this clear? So again, everyone should have, you know, we can't enforce the prerequisites for the graduate students because you're coming from different universities. But for undergrad, 213 is the requirement. And I know how hard 213 is. So you guys should have the background for this. So you should look at 513 and make sure you understand what's going on. Okay? So we provided some links on the website of how to debug and use GDB. Look at them. Okay? And lastly, please, please, please, please do not plagiarize. I will destroy you. Right? That means you should not find random source code on the Internet, which you won't find for this project because it's never been done yet, and do not just take wholesale copies of them. Now, Extendable Hash Table is 30 years old. There's going to be a lot of implementations out there. So if you find reference ones you want to use as the basis of yours, I can't stop you. Just please just don't copy code. The other thing that's really important is I ask you not to post your source code or your project implementation on GitHub, both during the semester and after the semester. I realize that a lot of you guys want to do this because you want to post it on your CV to help you get jobs, but please don't do this because then the next people next year will end up finding it and copy it. Okay? So again, plagiarism will not be tolerated. We'll kick you out. Yeah, yeah, yeah. Okay? But I'm serious. Please don't plagiarize. Yes? Do you have a good hash table to have a good hash function? And if you don't have a good hash function, it's not a recent one. So his question is a good hash table requires you to have a good hash function. Absolutely true. SDO hash will be fine. SED hash. It's fine. Okay? We're not worried about performance so much for these projects. We will care about this in the graduate course if you take it in the spring because we have a real database system. Then absolutely right. We care about hash functions. So we use city hash or murmur hash. For this, SDO hash is good enough. I can show you some charts where it does okay up to a certain key size and then city hash and murmur hash will beat it. For this, it's fine. Actually, you need to use SED hash because make sure our tests actually work as well. Okay? Any questions? Trust me, normal forms actually are honestly cool, but they're worth knowing. So we got through today. We'll get through Wednesday and then we'll get to the good stuff. Right? Think of this. You're eating your vegetables and then we'll get to the actual, the dirty tacos afterwards. Okay? All right, we're done. All right, guys. Thank you so much.