 Now, let us move to normalization, everybody's favorite topic, right? How many of you love normalization? Everyone, good. So again, what I am going to do is cover the principles and then we will do the tutorial examples. We have an example of a relation which is not normalized. So what is wrong with this relation? So what is this relation? It combines information from the instructor and department relation, so what all does it have? ID, name, salary, department name and then for the department, building and budget. So you can see a few problems in this table. You know that a given department according to our principles has a given budget and a given building and in particular, even if it has more than one budget or more than one building, its budget and building have no connection with the instructor. So what is happening here is if there are two people in computer science, let us pick CADs and Srinivasan, both are in computer science department, we will see that in this two places the same thing, Taylor and 100,000 appears and what are the problems with this? It is redundant, in the sense the same information appears twice unnecessarily. Worse still, when an update is done, you may end up updating one of them, not the other and then you have no idea which is the correct one, in fact this is a very common situation. In fact this keeps happening all the time, so I have a car which I drive and I deal with the servicing of that car but somehow my wife's phone number landed up in the dealer and they keep calling her, bugging her about my car and I have told them any number of times, please don't use that number, use this, no, it never gets corrected. Year after year, when the servicing is due, they will call her. So what has happened is they have copied this information in various places and now if I tell one person to update it, maybe it is just human issue, they have not given an update interface to the reception person but whatever it is, they are not propagating required updates to all the copies, if there was just one copy and it is updated, this kind of a thing will not happen and it is exactly the problem here. So if you did this on the other hand, so there we combine the instructor and department, here in the process of doing the ER model yesterday, we had a section entity and a section class relationship, we could have created a relation for section and a relation for section class using our principles. But we chose to combine the two schemas and actually this is, this is another example, this slide is another example of problems due to redundancy. Guess what the problem due to redundancy is, there is a, this schema here does not match the book schema. So I used this slide while ago, I created a subset of, a set of slides for this course in 2010, afterwards we realized there is a bug in this slide and several other, you know, we found a few other bugs, all of these bugs have been found and corrected but in another copy of the slide, not this one. The master copy which is on the book website, we have corrected all those but this one got left out. So we are not applying our own principles to our own slides but why, you know, the reason is that it is not easy for me to apply that principle because I am creating a copy of the slide accessible to you and it has a subset of slides from the main thing and not just a subset but I have also tuned some of the slides for this course. The moment I do that, I run into these consistency problems. Now I have to go and find out all the updates that happened there and apply them again here which is actually a big pain, it is very difficult to track that, it is possible but it is a lot of effort so I ended up not doing it clearly. Okay, so in this case section class, it would have not only section ID but also course ID, section ID, semester year, that was the bug. Section ID is not a primary key for section, therefore section class should have had all those extra attributes and the merge schema included all those plus building and room number. So this one was okay, this is what we are using and that's because why is that okay because the section has only or at most one building and class room, room number. So there will be no repetition, so if section had multiple of these then there would be repetition if I combined the schemas. So the key point here as all of you know is a notion of a functional dependency. Functional dependency is what lets me keep as combined schemas or keep a schema combined as opposed to breaking it up into parts. So the basic approach, so this is actually a meta point I want to make here. A lot of the theory on normalization assumes that you start from some big relation, so the formal theory initially assume what is called a universal relation which has all the attributes of all the things that you want to model and then functional dependencies and then you start breaking it up. You decompose it into the final schema you want. It turns out that is a very unrealistic model, nobody builds designs that way. The way people build designs is more bottom up. In fact, the recommended way is to go through the ER modeling exercise and then come up with a set of relations. But in the process of doing this, you no longer have that big universal relation with everything in the world, you have a smaller number of relations. What you need to do here is find out if those relations have, the resulting relations have some inconsistencies still. It turns out that most of the time, you don't have too much work to do. Thanks to, if you do the ER modeling properly, normalization is usually very easy, there's very little left to do. But still you should go through it in case you miss something in the ER modeling, this can catch problems in the earlier stage. Okay, so coming back here, we have a functional dependency, department name, functionality determines building and budget. And when we looked at this one, is department name going to be unique here? In this combined table, no, it can repeat. And therefore, the other information will get repeated. That's the key idea. So again, all of you are familiar with this. So what we want to do is decompose in order to fix this redundancy problem. But again, as you recall, you can't decompose just as you please. What happens if you decompose arbitrarily? Some information is lost. It's actually lost, but it's kind of funny. It looks like what happens is you may land up with extra tuples, but less information, it's counterintuitive. More data does not mean more information. So take this example, we have employee ID name street city salve. That's not the one we are using, it's just another new relation. So we have this information and if you decompose it into employee ID name, employee one ID name, employee two, name street city salve. What can go wrong here? Multiple people may have the same name. So if we decompose and then join, I think I have an example here. I have two Kims. One of whom lives in Main Street in Peririch. The other lives on North Street in Hampton. When I decompose these relations, I have these two tuples here and these two tuples here. Now, how do I get back the information when I decompose? To get back the original thing, I have to join it back. So when I join, I land up with four tuples here. This first thing here unfortunately now matches two tuples. It should have matched only the first one because that is the correct information. This employee ID lives on Main Street in Peririch. This person does not live on North Street in Hampton. But after decomposing, we have lost that particular association. And therefore, if we do a join, we get two extra tuples. But what we have actually lost is this information. Therefore, this decomposition is lossy. Our goal is to avoid lossy decompositions. We don't want to lose information. So we have to be careful. We have to decompose the avoid redundancy, but we should not overdo it and lose information. Again, you'll be familiar with this term. A lossless joined decomposition is one where if you take the original relation with a given set of functional dependencies, you decompose it and now if you join it back, you will get the original one as long as what? As long as the original data respects the functional dependencies. If it violated the functional dependencies, when you join it back, you're not going to get the same result. This should be clear. The decomposition is lossless only as long as the functional dependency held. If it doesn't hold, it will be lossy, potentially. There's another assumption here, which is that the relations which the attributes which we are using for functional dependencies, they are not null. If they are null, when you try to do a join on null, you're going to lose the tuples will not match. That's another assumption. So whenever you have a functional dependency, the left hand side of the dependency at least should be not null. So these are things which are often overlooked partly because when people initially did this theory, they didn't worry about null values. That was a distraction. Null values were introduced later to deal with a practical need. But people didn't explicitly mention these assumptions afterwards. Excuse me, sir. Yeah. The original relation that you have shown. So what are the chances that we can get the relations of that kind? If we strictly follow an algorithm to map an error diagram into a relational model, likely chances of getting such relations is in a dim. So that's a good question. As I said, it's unlikely it will happen if you did the error modeling properly. However, what often happens where the problem may arise is if you decide not to model certain things as entities. And you simply store the attributes in there. And this does happen when we did the IPL. We decided we didn't want to model something as entities. And then, if you have multiple attributes, there may still be functional dependencies there. And they're attributes of an entity with functional dependencies inside. And when you create a table out of it, you may violate some of the normal form. Do you mean to say that chances are only with respect to ad hoc approach? Like if you ad hocly, I mean, sub-tables may emerge. So if you go through ER modeling and you do a good job of it, there shouldn't normally be any issues left. But even if you do a good job of the ER modeling part, because you can't model everything in the world. At some point, you give up and say, this is a name. I'm not going to model it as an entity. So there is something there. But eventually, you will realize there are some functional dependencies across those. If you did it properly as a separate entity and a relationship, things should have worked out. Because the way we convert things back to the relational model, we are taking care of constraints and generating appropriate schema. We don't do merging of schema, for example, unless some dependency is satisfied. Otherwise, you'll keep it separate. So it's already decomposed to start with. So normally, it should not be required. It's an integrity check in some sense. Doing the analysis of functional dependencies. Make sure that you didn't make a mistake in the ER modeling. Of course, many people directly throw in relations, skipping ER modeling. Or they do an ER modeling and then add a few relations, which were not part of the model. Then you have to watch out for it. One more thing is that accessibility of relational model. Like unique constraint, then not null. Again, we have to refer back to requirements and then decide what is null, what is not null. Then is it not that the relational model lacks in expressibility? So let me take, I think there are actually two questions here. How do you decide what are the constraints? That's actually not an easy task. So if you talk to the end user, they will say that a department should have a building associated with it. And this was mostly true of IIT Bombay for a long time. Until some departments got a second building. And some have a third building now. So what seemed like an obvious integrity constraint, functional dependency, is now violating. But on the other hand, even if you know it's not a functional dependency, those who are familiar with multi-valued dependencies, will realize that there is a multi-valued dependency from department to building maybe. Because the connection from a department to a building has nothing to do with the connection from a department to its budget. So at least this part of the thing, if you have a dependency which you are not sure will always hold, you may say that, okay, right now it holds, but I'm not going to assume it for the purpose of the schema that I'm creating. But let me check if some other weaker form holds, like the multi-valued dependency. If I don't even assume that, then I can get into a schema with redundancy. So if I say that neither of these holds, then I might create something like what we saw, where the building name is repeated many times. So it is important to keep these in mind when you do the design. But it's also important to know what maybe holds today, but may not hold tomorrow. If it changes, you don't want to go and change the schema whole same. So you should design, at some point it's an art. Talking to users and understanding what constraints are fundamental and what just happened to hold now, but might change. That's more of an art, that there's no science to it. And that is something which has to be done before all the formal theory can be used. Because why am I asking you this question is that the implementer will look at the relational design and then start implementing. My point is that, looking at the mere design, it's not possible to conclude that this is null, this is not null. Again, he has to refer back to the requirements. I mean, other mechanisms, at a glance I can make out which is null, which is not null, which is unique, which are the candidate keys. So all of these ultimately go back to the requirement. So this is again where the ER model helps capture many of these requirements. If you go straight into relational model, again you have to go back to the requirements to understand what can be a null, what can be not null. But even if you do ER modeling, you have attributes. Which of these attributes can be null? Can you have a name of an employee as null? That's probably not something which anybody would allow. So that is part of requirements gathering. We are looking at the formal process that comes afterwards. But if you mess up the initial part, garbage in, garbage out. Things are driven by notion, this cannot be null, then more of being formal. Yes, at that point, you have to understand what is a requirement. You have to understand the final application, who is using it, how it will be used and make these decisions. So the theory cannot help you with that. So you have to apply your intelligence there to do it. So in that sense, this is an R. You have to apply your intelligence and do it. Of course, the theory might have a series of steps, check for all these. Maybe a checklist. But finally, should you do it or should you not? It's a judgment call which you have to make. Okay, any other questions? So coming back here, we have a relation R, A, B, C. We have decomposed to A, B, B, C. And the join of these two is, in this case, it is the same as the original. Now, I say that a decomposition of this instance of the relation may be lossless. But what if some other state of the database was such that the decomposition is lossy, how do I know? So all that is based on integrity constraints, functional dependencies that are guaranteed to hold on any valid instance of the relation. So we'll come to that. And what does all this lead to the functional dependencies and later multi-valued dependencies and other dependencies? It leads to the ability to create a schema which has certain good properties. So these are normal forms. The first normal form is actually a very simple thing. It says that attribute values are atomic. Now, if you have used relational databases, this seems obvious, right? Today, some of them do support set-valued attributes or arrays, but earlier they didn't. So why is this even talked about? The answer is older databases actually supported set-valued attributes. Relational, and that causes problems. And the relational model actually threw away those things to make it cleaner. And what we insist in the first normal form is that the domain is atomic. Now, what does it mean to say the domain is atomic? Values cannot be split up in any meaningful way. So for example, if we took a bank where each customer has several accounts, we could store the set of accounts with each customer. It's a set, an attribute which is a set. This causes problems with querying and so on. And the first normal form insist that you shouldn't do this. Create a separate relation which links the customer to the accounts. Each tuple is atomic value. In fact, there's another kind of issue which crops up even more frequently. In IoT Bombay, roll numbers have a certain format. Most places have roll numbers which start with a year. And then there are other fields. Now in IoT, the roll number encodes various things. It has two digits which stand for which department you are, one digit for department, two digits for which program, and so on and so forth. And this caused a lot of havoc because this was a primary key, roll number was a primary key. And the problem is what if a student changes the department? It happens. Now the roll number changes. What if a student, for our MTech students, the roll number would encode whether they are receiving a TA ship or not. This is madness. In the middle of a semester, a student changes from non-TA to TA. And the student list I get at the beginning has one roll number. When I submit my grades, there is another roll number. I have no idea what happened. All this happened because the roll number was not atomic. And it encoded stuff which it shouldn't have encoded. So after a long fight with the academic office, we told them this is a stupid design, get rid of it. Of course, they won't listen to us. So after a long fight, the maximum we could get is, we will no longer change roll numbers. Of course, now there's a roll number which says, department is electrical engineering and that person doesn't see us. Brand change, so be it. You should not have put it there, but now that you have put it there, don't interpret it. There's a lot of pushback from the users, from those people. They don't understand what are the implications. They just look at their part of the world. The academic office only cares about what they see. They don't care about what problems others have. So anyway, coming back here, atomicity can be because of sets. It can be because you interpret various digits in a key. You shouldn't do that. In fact, we had applications which would extract two characters from the roll number to decide which is the department of the student. Those applications had to be rewritten because they said, you can't do this anymore. And in fact, once you did that, it was a lot better, because now you can have a foreign key dependency which makes sure that those two digits are valid. You can have queries which do joins and so on, which you couldn't do before. So it's a lot better than before. This is the example of roll numbers here and so forth. In the last-less decomposition, that master table just normalized into two tables, right? Yeah. Last decomposition, just a previous slide. Yeah, that one. In master table, just you normalized into two tables. This particular one was a mistake in breaking it up. Yeah, in the second table, is it necessary to consider the reference key or not? Again, just you are joining the two tables, right? R1 and R2, you are joining, applying the natural join. Yeah. In the second right table, just see is it necessary to consider the foreign key or not? The problem is that, this is a bad example. Why are you creating a foreign key on a choice which is a mistake? Don't do this. The point is to say that don't do this. Now supposing the decomposition were actually lossless, we'll see the conditions for it coming up, but then there would be a functional dependency and corresponding to the decomposition, we can also create a foreign key dependency. We'll come to that later. Okay, thank you. So the whole normalization theory looks for bad designs which have problems of redundancy and so forth, or it could also have a problem of not being able to represent certain information. So yesterday we talked of binary versus ternary relationship, right? So I said here is an example which cannot be cleanly done as a binary relation. It is actually a ternary relation. Now of course we could encode it as a binary relation by creating a new fake entity which we did for the IPL example. So that workaround is always there. However, if I carelessly just created binary relations, I will not be able to represent the information that a particular student is working under a particular professor on a particular project. So that was the example, student, instructor, project. Now normalization theory, if you have already created that, it actually doesn't help you. If you started with the universal relation model, then it'll help you. If all of them were together, we had a ternary relationship and then say, is it okay to decompose it? The theory will tell you, no, it's not okay because certain functional dependencies don't hold. You cannot do this. But if you started with a design which was already decomposed and then you apply the theory, the theory will say it's fine. So if you start with the universal relation, you're okay, but if you start with something else, the theory may not be able to help you after the mistake has been made. Is this point clear? What it does is it takes a relation which you have and tells you whether it's okay to break it up. But it doesn't tell you, maybe you should have combined these. That's not part of what the theory says. So if you want to apply it right, maybe you should go back and try combining all the relations and then see if it's okay to break it up to get what you have. Then you will catch that. So what is a functional dependency? It's a constraint on the set of legal relations. A relation which violates the dependency is illegal. That's the key thing. You should not store such data in the database. And we will use the notation alpha and beta as usual of, if you're familiar, this is commonly used. I think all textbooks use similar notation. So alpha is a set of attributes. Beta is a set of attributes. And these are all attributes of some relation, r. So I'm going to say let r be a relation schema. What do we mean by relation schema? I mean the set of attributes of that relation. If you look at the SQL schema, it includes the types and so forth. We are not going to bother about the types here. It's not important. So the set of attributes of a relation is what matters. And we'll use capital R to denote a set of attributes. And relation instance, r to denote an actual relation, small r, that's a notation. The functional dependency alpha determines beta holds on r. If and only if any legal relations are on that set of attributes, whenever two tuples T1 and T2 agree on alpha, that is they have the same alpha, then they also agree on beta, that is they also have the same beta. So what is this saying? T1 of alpha means project the tuple T1 on the attributes in alpha. So if the projection of T1 and alpha equal to projection of T2 and alpha, then the projection of T1 and beta must equal to the projection of T1 on beta. Now, on this instance, RAB does A determine B hold and doesn't because these two tuples both have one, but B is different. How would B determine say, it holds on this instance. That is not the same as saying it must hold on the relation. Maybe just that I don't have enough data. This is the data I have. Maybe there's another legal thing which I would consider legal in the real world where there are two tuples with the same value of B but different values of B. So determining whether a functional dependency holds cannot be done by looking at instances. It's a property of the real world which you have to understand and use that to decide what holds in the first place. Given a set of functional dependency hold, the theory will help you figure out what else holds and how to go about normalizing the relations. So we'll come back after the break and go more on normalization and functional dependencies.