 So, the slides here are organized in the following manner. We start with intuition of what is normalization and then we give a very brief definition of normal forms using after introducing the notion of functional dependencies and after that we get into details of functional dependencies including how to do attribute closure and then get into the definition of BCNF once you have we have seen how to you know do inferencing with functional dependencies and then we will also look at canonical cover and third normal form and then a little bit about the database design process and temporal modeling towards the end. So, as is traditional with normalization we start off by showing a bad design in this case where we have instructors which have been joined with their department information. So, this could be the result of a query that is fine, but if instead of storing instructor and department separately or design somehow combine the two you are going to get a table like this where there are let us look for some repetition here computer science cad's is in computer science as is brand and then the building and budget information is repeated here. So, that is a simple example of redundancy I am sure you are all familiar with that. So, here is another example we had section and then we had section class relationship supposing we combined this two. So, we have course ID, section ID, semester year building room number. So, these could have been two relations which we would have got from the year design if we created a separate relation for section class we actually did not we actually combined it to get this thing here there is no problem with repetition although we have combined two tables there is no problem with repetition. So, what these two examples illustrate is in some cases there is repetition in some cases there is not and functional dependencies basically help us understand in what cases there will be repetition and in what cases we can be sure there would not be repetition. Why are we sure there would not be repetition here well of course in this case this is actually a special case where this is really the same as this when we combine it there is no extra information here. So, you know the merged we actually throwing out one relation, but in general this would not be the case where one is subsumed by that. So, the basic idea is we start with a design which came probably from ER, but in many cases people directly come up with a database design and then we need to figure out whether that design has redundancy. There is another thing which is required which is not actually covered by the functional dependency theory which is can this design actually represent all the information that we need to represent and we will come back to that issue in a little bit. So, the idea is that if there is redundancy we will decompose it and how do we detect redundancy there is repetition of information of some form. So, what tells us that information is repeated? There are several kinds of constraints on the schema on the data which is stored in there which tell us that in these cases some information is going to get repeated. Now, what kind of constraints are there? Functional dependencies are the most widely used kind of constraint there the most common and the most useful which will tell us that there is some redundancy, but they are not the only kind. There are others with theoreticians have come up with including multivalued dependencies which we will see briefly and then there are other kinds of dependencies which we will not cover here which also show that there is redundancy. So, there is redundancy what is the standard way of dealing with the problem of redundancy? The standard way is to break up relations such that if you join them back you will get back the original information. However, there is no redundant information stored which also brings up the question why are we so worried about redundancy? Is it disk space? Certainly not, I mean today we have such large disk that except for you know few web scale databases, email and other information that Yahoo keeps about you or Facebook keeps you know except for those for most organizations the space is not really a constraint for relational data for video data yes for relational data it is not space. So, redundancy is not about space redundancy is about inconsistency. Why would you have inconsistency? If you took instructor and department and joined them we can be sure there is no inconsistency between two tuples in there which both belong to Compsi. We know they will have the same building and the same budget, but if you stored the relation as a join and then you updated there is a risk that you updated one tuple and didn't update the other and what makes it worse is afterwards once this has happened somebody who comes in doesn't know which is the correct one is it the first one or the second one which is correct they don't know. So, you want to make sure there is no scope for these kinds of errors where an update can leave the database in an inconsistency. So, if you ensure there is no redundancy there is no problem with updates. Now, there are other ways of dealing with it too which is if there is any redundancy it is not at the level of you know something which the user would directly update, but it could be something which the database computes and an example of this is materialized views which we discussed earlier. So, in a materialized view there is redundancy it's repeating information which is in the base tables. However, here update is not an issue because you will not go in directly update the materialized view normally that is usually the job of the database system and the database system will ensure that when you update the underlying relations this will be kept up to date. So, there is no real problem with redundancy as far as consistency goes. So, that is you know if you want performance you make sure that there is no redundancy here and then use materialized views to deal with performance or other issues that is a general way in which systems are designed. So, now in this particular case as not surprisingly the functional dependency which tells us that there is a problem when we join instructor and department is department name functionally determines building and budget. So, given a department name the building is unique the budget is unique you cannot have two different budgets to the same department or two different buildings. Now, of course computer science department here actually has two buildings. So, this model is a situation where each department is in one building. If you didn't want to insist on that we would have to have a separate department building mapping table again to keep the schema small we have done this. So, again lot of the motivation for our schema design is not to make it realistic, but to keep it simple. Now, here is another example of a decomposition which is bad. So, supposing we had this thing employee with an ID name street city salve and we decompose this into employee ID name and employee employee 1 ID name employee 2 name street city salve. Now, what would go wrong here? Yeah, if there are two employees with the same name what is going to happen is this one is ok, but this one will not map uniquely to one here there are going to be two rows for two different employees with the same name different street city salve. And if we join it back we are not going to know whether an example here supposing I started with this these two people had the same name the name Kim is common to about half the people in Korea apparently. So, there is a high probability of the name repeating. So, if you decompose it and then join it back here are two tuples this first tuple and this first tuple actually were matching the second and the second were matching. If we join this back we are going to get four tuples. So, when we decompose the relation and then join it back have we lost tuples no we got the original tuples back, but we also got extra tuples. So, what have we lost? We have lost information we have not lost tuples, but we have lost information that this corresponded to this and this address corresponded to this second Kim. So, this decomposition is a lossy join decomposition. Now, you might start with a schema where by mistake you have already decomposed it into a lossy join decomposition. How will you recognize this? If you started with a schema which is already decomposed and now you use functional dependencies to check if there is any redundancy even you may not find any redundancy. If you look at these two schemas you already decomposed is there any redundancy here? There isn't here the functional dependency will say id determines name here there is no functional dependency assuming two people may share the same address which can happen if they are married for example. There is no redundancy here. So, if you look at these two relations everything seems fine, but there is a problem if you look at the information of the enterprise you want to represent you will realize that you are not able to represent certain information. So, that is a separate issue which you also need to be concerned about in design. Now, here is another example of a lossless join decomposition this is just a instance of these two relations where we decompose it into AB and BC and what do we see here we project and join back you will get the same thing. Now, what property here shows us that would guarantee rather what kind of constraint using functional dependencies would guarantee that if you decompose this like this and then join natural join it back you will get the original one. The common attributes yeah so R 1 intersection well if you mean the intersection of the attributes yeah not the relations the intersection of the attributes is in this case B. So, that should functionally determine it should functionally determine either you know the remaining attributes of this one or of this one. Now, why is that a sufficient condition in this case again this is an example which satisfies the functional dependency, but in reality we have to insist that the functional dependency must hold on the relation that is any legal instance of the relation must satisfy the functional dependency. We cannot just look at this example and say this relation satisfies the functional dependency all we can say here is that this example satisfies this instance satisfies the functional dependency. So, what is the functional dependency given D. So, in this case we could insist either that B functionally determines C what does that mean if there are two tuples which are the same on B they must be the same on C also here in this case if there are two things which have the same B value their C value must be the same or conversely we could also alternatively insist that if two B values are the same then their A values must also be the same. So, that could be a sufficient condition it is not a necessary condition there are other conditions which also guarantee lossless. Decomposing the relation and fragmenting the relation what is the exact. So, because indistributed database fragmentation concept. Right. So, in distributed database there is a notion there are two notions of fragmentation one is vertical and the other is horizontal. So, horizontal fragmentation is just keeping some tuples here and some tuples there that does not have any impact on the schema itself because you can always union the tuples and get the same thing back. The vertical fragmentation is identical to decomposition. So, in distributed database when you do vertical fragmentation we will insist that there is a key. So, essentially the same lossless join decomposition condition will be used there. In fact, we generally use a stronger condition there that there is an value which is a key for both of the sides. For lossless join we only need to insist that it is a key for one of the sides. I mean I am kind of jumping ahead I have slides on all that bit ahead but in the distributed database case normally you would have something which is a key on both sides. Yeah. Any questions? So, we also have this notion of normal forms of which the first normal form doesn't depend on functional dependencies but the later ones are defined in terms of functional dependencies and some of the other ones are defined in terms of other kinds of dependencies like multi-valued and other dependencies. So, what is a first normal form? This is one of the basic things. Again most of you are familiar with this that we insist that every domain is atomic that is in any relation in any attribute. Whatever value is there has should not be decomposable into parts. So, what does that mean? You cannot have a set. This history comes from before relational model in the earlier network and hierarchical models. It was common to have a set of values. So, you have a record where one of the attributes was a set and the relational model basically recognized that this is a bad idea because it complicates the whole model for joins and everything else and it also encourages designs with redundancy and therefore the relational model insisted that any value in there should not be a set. It should be atomic but it actually means more. It's not just that it is not a set. Well, even composite attributes are kind of banned. However, we know that they are not really a problem. Composite attribute is just a name for a collection of attributes. But at least in the basic relational model that is also banned and we saw already how to convert a composite attribute to flattened thing. We also saw how to convert a multi-valued attribute to a relational schema. So, we know it can be done. So, a relation is in first normal form if the domains of all its attributes are atomic. Now, atomicity is not just that the value is not a set but it's also a property of how the elements in the domain are used. So, for example, in IIT Bombay in earlier years, we had a roll number which encoded the year the person joined the branch and then some other things and finally a sequence number at the end. And people actually interpreted these digits. So, they would say the fourth and fifth digit or third and fourth digit that is a code which says which branch the person is in. And people would, not only would humans look at the roll number and say, oh, this person is in computer science or this person is in E. But we even had programs in our application programs which would look at those two digits and say this person is CS or EE. Now, what is happening here? You are actually breaking up what appears to be a string into pieces. You are interpreting it. So, this is also an example of something which is not in first normal form because you are breaking it up. Now, what is wrong with doing this? Well, there are many things but the most important practical problem with this is there were several. But one of them was that you are breaking up a string into pieces and those pieces are supposed to be a foreign key. Now, you can't enforce that in the relational model. That is one problem. The second problem is what if this student changes branch which can happen? Now, the roll number has to change because those two digits are interpreted as the branch. So, roll number has to change. If you do that, a change in one relation which recorded which branch the student was in required a change in the roll number which caused a cascading thing in every place where the roll number was used. So, that was a mess. So, internally actually the system was designed to store some other student ID which was unique which wouldn't change. So, everywhere where registration and other information the student ID was used not the roll number which again complicated the design and queries and so on. But the worst problem was that roll numbers would get updated periodically. And inside the database you could deal with it. Outside of the database when I have the marks for my students in the middle of the semester the student's roll number changes. And the list I get at the end of the semester for putting in grades does not match the list I had at the beginning of the semester. And I have to guess that this student is really now this roll number. How do I do that? I do it on the name. What if two students of the same name both had a roll number change? We are in trouble. So, the bottom line is if you start interpreting parts of an attribute there are a number of problems with it. So, you really shouldn't do that. Cubans may do it, but in the database you should not do it. Yeah. I normally find very tough time explaining this atomicity, but I mean, see complex attributes could be argued that not atomic. Yeah. And sometimes some people say. You are right. If you treat a set of values and never break it up into pieces, never look at it in the database and it's only external, it may be okay. There are also other reasons for efficiency of storage and so on. Why people sometimes use set valued attribute. There are issues. So, if you just want to represent data correctly, you shouldn't worry about efficiency. But when you build an application you obviously want it to run reasonably fast and not take up too much space. So, it would be nice if you can have a conceptual model and then a storage model which differs maybe significantly from the conceptual model. That would be nice. So, for example, let's say I have a, I want to build an index on the words in a document. Now, the database may support the index natively, but supposing it doesn't. Then one way to do it is I can create a relation with a record ID, so a primary key and keyword. So, I have something with record ID and so this is a relation. So, if a record ID one had keywords A and B, I'll have one A, one B. If I allowed a set of keywords, I could have written it as one A, B. Now, if you see this representation, there are two tuples. Each tuple has some space overhead and also this record ID is represented twice, whereas it's represented once here. This is not allowed in first normal form. In first normal form you have to do this. As long as the relations are small, we don't care. So, it's, you know, why do we bother about normal forms? There are two reasons. One is redundancy and update issues and so on. And the other is efficiency. So, there are trade-offs between these and sometimes you, in general, you would like to not worry about efficiency, at least at the beginning. And at the end, when you realize there is an efficiency problem, give a bit of slack and move away from conceptual cleanness to allow a little more efficient representation. So, for example, in this situation, so we have a database with documents, lot of documents. And this keyword list for a document can become very, very large. We also need the opposite way. So, given a keyword, I want which all documents can contain it, 1, 3, 9, 11 and so on. So, this relation here actually includes information which is equivalent to either this or this. Both of these would be here. So, conceptually, the relational model says store this and then if I look up all the tuples, the keyword, I will get all the record IDs. If I look up a record ID, I will get all the keywords. So, conceptually, that is fine. But practically, there are two issues. If the relation is stored in record ID order, if I want to find all the records with a particular keyword, they are scattered all over the relation. And conversely, if I want to find all the keywords in the relation, they may be clustered. But if I flip it and cluster the relation on keyword, then this would be inefficient. So, for efficiency sake, I may actually keep this and this. Depending on what I want to do, I may keep both, although this is the conceptual model. So, again, queries may have to be written, taking into account the fact that this is the actual schema. But if you had this conceptual schema and the database would automatically maintain these two and provide these as a view, then I could query these views, but store this normalized form as the underlying representation. But this could be viewed as a creative maintaining indexes or tweaking with the physical part of there. Exactly. Or tweaking with the implementation part. That would be ideal. That's the correct way of doing it. In fact, there have been a lot of proposals which say that you can have a conceptual schema. But the physical database may store some other schema. It doesn't even store the conceptual schema. For efficiency, it stores some other schema. But you can write your queries and updates based on the conceptual schema and the database deals with all of this. So, again, no database really supports that fully, but some aspects of it are supported. Okay. So, I think we have digressed a bit. Let's come back to make sure we finish on time. Okay. So, most of the chapter is going to focus on decomposition based on the functional dependencies and a little bit on multivariate. So, let's look a little more formally at what is the functional dependency. Let's look at example, then I'll go back. The functional dependency has a left-hand side and right-hand side, both of which are subsets of attributes of the relational schema. You're probably familiar with this. And we say that a functional dependency holds if on any legal relation on that schema. Whenever two tuples of that relation agree on attributes A alpha, they have the same value of alpha, then they must also agree on the attributes beta. So, that is if for any pair of tuples T1, T2 in that relation, T1 alpha equal to T2 alpha implies T1 beta equal to T2 beta. This is the standard definition. And as an example, if you have this one, this is A and this is B. A functionally determines B does not hold because there are two tuples here with the same value of A, different values of B. B determines A holds on this instance. So, that doesn't mean it will hold on all instances of R unless we say that this is a constraint on R. So, we use it in one of two ways. We either say that, you know, here is an instance. Let us see what holds on it, but that is not particularly useful. What is more useful is a constraint on the relation which says in any legal state of this relation, this dependency must hold. So, that's what we are going to be using. So, now we have seen this notion of super keys on a schema. So, if R is the set of attributes in this schema and K is any subset of that, then we can say that K is a super key if K functionally determines R. Again, you are probably familiar with these. For those of you who aren't going over this. And that is a super key. What is the candidate key? It's a super key which is minimal. So, you cannot throw anything out. So, what this is, K is the candidate key if and only if K functionally determines R. That's uniqueness. And for no alpha subset of K, does alpha functionally determine R? Do we actually have to check every subset of K? To check if it is a candidate key, do we have to look at every subset of K? We don't. We can just look at subsets which are one attribute less. So, if K has five attributes, we can look at all four attribute subsets, remove one attribute at a time. And again, check if the remaining four attributes are a super key. If any one of these four attribute subsets is a super key, then we can be sure that K is not a candidate key. If all the four attribute subsets are not super keys, then we can say that K is a candidate. And also note that functional dependencies generalize the version of super keys. So, if you only have this idea of super keys, which is the key declaration in SQL, then here is the functional dependency on this combined instructor-department relation, which says that department name functional determines building and budget. We cannot express that using only super keys. But we can express that by saying department name determines building. We can also say that department name determines budget. But we could also say that ID determines name, salary, department name. And ID also determines building and budget. So, all of these hold. But you would not expect that department name functionally determines salary because there may be two people in the same department with different salaries. So, given a set of functional dependencies, again we can check if a particular relation is legal. So, that is one way. And we will say that that instance satisfies F. Or we will specify constraints on the set of all legal relations, where we say that the functional dependency F holds on that schema R. That is the notation we use. Again, by chance certain things may satisfy a functional dependency. But we will not say that that holds on the schema. Again, some notation there is what are called trivial functional dependencies, which are going to hold on any schema. Now, what are these? They are basically those where the right hand side is a subset of the left hand side. So, here are two examples ID comma name functionally determines ID, obviously. Or name functionally determines name, obviously. So, these things we do not really care about. They are not very useful. Now, let us see what are some things we can do with functional dependencies. We can, given a pair of functional dependencies A determines B and B determines C, we can say that given an A, B is uniquely determined. Given a B, C is uniquely determined. Therefore, given an A, C must be uniquely determined. So, that is a transitive closure of, well it is not the closure, it is just a transitive combination of these two. So, we can determine that A functionally determines C. Now, we could perhaps make more inferences and there are several ways in which we can make inferences using functional dependencies. We could just, you know, try it out and take everything and try to prove whether it holds or not from first principle. What is the first principle? The basic definition that if the left hand sides are equal, the right hand sides are equal. Just using that definition, we can infer a number of things. But what we would like is a somewhat more systematic way of making these inferences rather than just using the original definition. So, we are going to see how to do that more systematically. But conceptually, the set of all functional dependencies logically implied by a given set of functional dependencies. Now, F over here denotes a set of functional dependencies. So, the set of all things logically implied is the closure of F. In this case, the closure of A, B and B, C is what? Well, the closure is A, B, C and A, C. We cannot infer anything else in this case. This is it. Well, that is not strictly speaking true. We can infer a number of trivial dependencies from this. We can also infer a number of things which are not quite trivial. But so, for example, C A determines C B. If A determines B, we can also infer that C A determines C B. C A determines C B trivial? It is not trivial. The right is not a subset of the left. How do we, what makes us believe that if A functionally determines B, C A functionally determines C B? We know that if two tuples are equal on C A, well, the C's are obviously equal. The B's are equal by the definition, because we are saying that A functionally determines B. So, C A will determine C B. So, we can infer these kinds of things also which are not trivial. On the other hand, they are easily inferred from the existing ones. So, there are a number of things one can infer given a set of functional dependencies and the entire set is the closure. So, the closure can be very big. It can have an, it will have, if you include the trivial ones, we are pretty much guaranteed it is exponential size. If you, even if you exclude the trivial ones, it can still be exponential size. Exponential in what? In the number of attributes. F plus denotes the closure of F. Generally, we do not really want to compute the closure as far as possible, at least manually. You can have a program to do it, that is fine. You know, you may say, even for a program exponential is bad. But it's kind of okay because relation schemas tend not to be very big. How big are they? 5, 10 attributes. You know, 2 power 10 is not such a big deal. It's this thousand. On the other hand, if you had a schema with 1,000 attributes, you have a problem. But I have never seen such a big schema. Except in situations where you have what are called flexible schemas, where you can introduce attributes as you please. There are some systems which allow flexible schemas. But there, you are not going to worry about normalization. So, that's not an issue. We will see in more detail how to do inferences using functional dependencies. But before we do that, how are we going to use this to define normal forms? Again, this is something many of you would know already. We say that a schema is in BCNF or Boyce Codd normal form with respect to a set F of functional dependencies. If for all functional dependencies in F plus, this is in the closure, not just in the original set, if everything in the closure of the form alpha goes to beta, one of where of course both of alpha and beta must be subsets of R, then at least one of the following codes. Either it's trivial, alpha is super set of beta, super set of or equal to, or alpha is a super key for R. So, coming back to our earlier instructor department example, is this in BCNF? It's not, because department name functionally determines building and budget, but is department name super key for this? It's not. Therefore, this schema is not in BCNF. So, BCNF prevents a certain kind of redundancy. So, given that schema, what do we do? How do we fix the schema? The standard way is decomposing and the way we do it is, if you have a non-trivial dependency, alpha goes to beta, which shows that the schema violates BCNF, what you do is break it up into two schemas. One is alpha union beta, the other is R minus beta minus alpha. So, we are removing beta from R, but we have to make sure we don't remove the attributes in alpha from R. They have to be common to both. So, if in case alpha and beta overlap, which can happen, we are going to remove alpha attributes from beta and then remove the remaining ones from R. That's how we decompose. So, in our example, alpha is department name, beta is building and budget and therefore, we are going to split it into department name building budget and the other one will be ID name salary department name. This was the original schema we started off with. Now, what if we use the functional dependency? Department name functionally determines building budget department name, which is also a valid dependency. It's not trivial. If we decompose using that, this step beta minus alpha would remove department name from beta and then we would get the same result. So, that's that takes care of. Now, there's another notion called dependency preservation, which probably we'll come back to at the end, but basically the idea is if you have functional dependencies on one relation, they can be checked efficiently. If you have functional dependencies, which span two relations in the decomposition, the only way to check it is to join them back. And on the other hand, if the functional dependencies which exist on individual relations imply these, you can logically infer the other ones from the ones which hold on single relation, then you don't have to have to check the ones which cross the relations in the decomposition. So, that's the notion of dependency preservation, which we'll come back to later. But that motivates third normal form, which says that if I want to make sure the functional dependencies hold, then I'm going to decompose in such a way that only by only checking functional dependencies which hold on the individual relations in the decomposition, that will imply that all the original functional dependencies also hold. So, I don't have to check for those individually, because checking those may be expensive. So, that's the motivation for third normal form. And the definition of third normal form is very much like BCNF, except for one step. So, the third normal form, the first part is the same. For every thing, every dependency alpha goes to beta and f plus, either it's trivial or alpha is a super key or this is the last one. Every attribute A in beta minus alpha is contained in a candidate key for R. We'll see later why this is useful. But without proof, I'll just state that the third condition is a minimal relaxation of BCNF to ensure dependency preservation. I'm saying here we will see why later, but for the abbreviated version it's not there, but it's there in the book. So, what are the goals? We want to decompose such that each schema is in a good form. Well, we want to check if it is a good form, if not decompose. The decomposition has to be lossless joint and preferably the decomposition should be dependency preserving. But that is a trade-off there between that and redundancy. We'll see that. Okay, so let's focus now on functional dependency theory. That was a quick introduction to what we want to do with functional dependencies. So, how do we compute the closure of a given set of functional dependencies? That's the first question. We've already seen what is the closure. So, there are a set of three rules called Armstrong's axioms which are actually sufficient. We just apply them and we will get the closure. Now, what are the three rules? The first rule is reflexivity. If beta subset of alpha then alpha function determines beta. What does this rule do? It basically generates every trivial dependency because these are all trivial ones. Beta subset of alpha is trivial. So, this one rule basically generates all trivial ones. The second rule is if alpha functionally determines beta, then gamma alpha functionally determines gamma beta. Now, this rule is called augmentation. Now, again, we don't have to do this for a, you know, we can use this rule to show just one trivial dependency or we could use this rule to generate all trivial dependencies. Similarly, here with augmentation, we could use this rule to augment one given dependency with one particular gamma or we can generate all possible augmentations of alpha goes to beta. That depends on what we want to do with these rules. And finally, the transitivity one which we saw already, alpha goes to beta, beta goes to gamma, then alpha goes to gamma. Now, it's easy to show that these rules are sound, that whatever inferences you make here are correct. How do we show that? We can show it from first principles. We can show that if the definition of functional dependency holds for this, then it'll also hold for this. And similarly, if this holds and this holds, then that also holds. So, it's easy to show soundness. What is not easy is to show that these are complete. How do you know that these three rules will generate all possible functional dependencies in F plus? Again, that's not trivial to prove. So, we won't try to prove it, but it is true. So, here's an example with these functional dependencies on this schema. And so, let's look at some inferences we can make. We have inferred A goes to H. How do we infer that? Yeah. So, that was easy. A determines B and B determines H. So, transitivity we can get that. Now, here's the next one. A G determines I. How do we get this? Yeah. So, from A determines C, we can determine, we have to augment it with A G determines C G. And then we can use transitivity to, from A G goes to C G plus C G goes to I to determine A G goes to I. Any questions? Okay. There are many more inferences one can make. So, we will look at more of these in the tutorial session. But, if you are using Armstrong's axioms, those are just three axioms. They are enough, but it's often useful to, you know, use these extra axioms, which can be proved using Armstrong axioms. And what are these? The union rule says if alpha determines beta and alpha determines gamma, then alpha determines function determines that is beta gamma. So, we can union the right hand sides. And conversely, if alpha function determines beta gamma, then alpha must function determine beta and alpha must function determine gamma. How do we prove these? Again, we can prove it either from first principle using the definition of function dependency or we can use Armstrong's axioms to prove these. We'll again do some of this later. And finally, there is pseudo-transitivity. I mean, it's not finally, there are many more such rules which we can create, but these are in practice the most useful ones. So, the last one is if alpha function determines beta and gamma beta function determines delta, then alpha gamma function determines delta. Does this look familiar? We just saw an example of it, that AG determines I, which we showed. It's really an example of, we could have directly used pseudo-transitivity rule to determine AG goes to I, but we equivalently showed it in two steps. Then there's another notion that what we have seen so far is the closure of the set of functional dependencies. There's a different notion of closure which is closure of an attribute set. What is the closure of an attribute set? Given a set of attributes alpha, the closure of alpha under F denoted by alpha plus is the set of attributes that are functionally determined by alpha under the constraints F. So, here's a algorithm to compute it and then we'll see an example also. So, how would we compute this? We initially set the result to alpha. Now, we are going to look at any every functional dependency beta. Now, if beta is a subset of result, so what can we say? So, what we have started off with is alpha is the initial set. It functionally determines alpha. Now, if we have initially at least, we have a functional dependency whose left hand side is a subset of alpha. Then we can clearly say that the right hand side attributes are also functionally determined. Do I have an example here? Let me do it by example and then come back to the algorithm. So, this is the same set of dependencies. I want to compute a g plus. So, given a g as the initial set, I know that because a functionally determines b, I can infer that a g functionally determines b also. So, what is the first thing? Let me use the board. So, I will start with a g. Now, using a goes to b, I am going to add b to that set. Now, using a goes to c, I can add c to that set. Now, using c g goes to h, I can add h to that set. Now, again using c g goes to i and add i to that set. Using b goes to h and add h to that set. It is already there. That is it. There is nothing more which can be added. So, a g plus is equal to that. So, what can we say now about a g? It is a super key for this schema. Is it a candidate key? We have to check a plus and if required g plus also. What is a plus? It will be a b c h and that is it. We cannot progress anymore. Where did you get i? No, i c g. We do not have g here. So, we cannot include i. So, this is it. So, we can be sure that a plus does not include this and we have to check for every subset. What about g plus? Nothing. We cannot do anything. So, g plus is equal to just g. So, neither subset of a g is a super key. Therefore, a g is a candidate key. So, attribute closure is actually a lot easier to deal with than using Armstrong's axiom. You can kind of do it blindly. So, it turns out that many things, if you have to check for certain things manually, attribute closure is an easier way to do it than using Armstrong's axiom. We will see that in a moment. So, this is the algorithm. Result is alpha and every keep doing this repeatedly whenever you have a dependency whose left hand side is a subset of result, then add the right hand side to the result till nothing more can be added. So, you have seen this already. This was the candidate key part which we already did on the board and this was the a g plus. We have seen this on the board. So, what do we do with attribute closure? There are many things we could do with it. One is what we just did. If you want to check if something is a super key, we take the closure. Similarly, we can check if it is a candidate key. We can also use it to check if a particular functional dependency holds. How do we do that? If you want to check if alpha function determines beta, it is actually very simple. We just take alpha plus and check if beta is contained in that. This is a very simple test for functional dependencies without actually computing the entire closure. If you want to compute f plus, there is a lot of stuff in f plus. So, we really do not want to compute f plus. It is too big. Whereas, this is a much cheaper way of checking if a particular functional dependency holds. Now, if you want to actually compute the closure of f, there are two ways of doing it. One is we use Armstrong's axioms and keep making inferences. The problem with that if you are doing it manually is it is very confusing when there is nothing more you can do. You have to take multiple dependencies at a time, do the closure. You have to know which ones you have already considered. So, you can do it, but doing it manually is a little difficult. The simpler way of doing it manually is to take every subset of r and then find the closure of the subset. Then, for every subset of that closure, we add a dependency, gamma goes to s. Although, this last step for every subset of the dependency is often not very useful. So, if we are dealing with closure, instead of taking the exact closure, sometimes it is just enough if for every subset, we compute the closure and keep that. We can use that to check certain things. But, if you want to get the actual closure of f, then for every s subset of gamma plus, we have to add the dependency, gamma goes to s. That is the algorithm. Excuse me. Sir, in your book, there is a concept available like canonical cover. Yeah. We are going to cover that. Here, we are using this to test for the super key. We can use that concept, canonical cover concept for testing the candidate keys. So, canonical cover is no. You cannot directly use that for candidate key. You can only use it to find a minimal set of functional dependencies. But, that does not help you to check if it is a candidate key. A candidate key test is actually very cheap. If you want to check a particular one thing to see if it is a candidate key, it is cheap. If you want to enumerate all the candidate keys, then that is more expensive. Exactly. Yeah. It has some regions of drop. No particular reason. We could have defined it formally, but we have used the concept. Right. Now, coming back to lossless joint decomposition. If you have a schema which has R1 and R2, which actually, this should be union because R1 and R2 may overlap. They should overlap normally for lossless joint decomposition. So, if you decompose R into project on R1 and project on R2, that is a decomposition. And now, if you take the decomposition and join it back, lossless joint decomposition requires that this joint must be equal to the original R. Now, a sufficient condition for this is the following, which we briefly discussed at the beginning, which is R1 intersection R2, that is the intersection of the attributes. In other words, the attributes common to these two must either functioned in R1, that is, it is a super key for R1 or it must functioned in R2. Either one is enough. Now, how do you show that this will ensure lossless joint? That is actually fairly straightforward. You can show that if you had a, you cannot get extra tuples. When you do a joint, if you get a particular tuple because on one side, you know, this dependency holds that if you take any tuple on the other side, it will combine with only one tuple on this side. So, you will not get any extra tuples when you do the join. And therefore, you can show that the joint relation will be exactly equal to the original relation. So, it is lossless joint. Now, is that a necessary condition? Are there situations where that functioned dependency does not hold, but it is still lossless joint? Yes, there are such situations where it can happen using multivalued dependencies, for example, or maybe even joint dependencies, which we are not going to cover. Now, given this particular relation with this pair of functional dependencies, we can decompose it into A B goes A B and B C. Is this lossless joint? Yes, because B is common and B determines B C. Now, this is other notion of dependency preserving. So, what are the dependencies which are there individually? On R 1, you have A B. And I am not just looking at which of these are there in R 1. By the dependencies which hold on R 1, I mean what all in F plus use only attributes in R 1. In this case, A B is really the only one here on R 1. Similarly, on R 2, B C is the only one. Now, is this decomposition dependency preserving? Yes, in fact, A B is preserved here, B C is preserved here. So, very trivially, it preserves all the dependencies. However, there is another decomposition also of A B C, which is A B and A C. Is this lossless joint? Well, the common attribute is A and it functionally determines B. Therefore, it is lossless joint because A determines A B. Is it dependency preserving? It is not. So, what are the dependencies here? A B is preserved. Now, B C is not present here, but from A B and B C, you can also infer A C. So, you can actually check A C here. On this one, you can check if A C holds. So, we can check if A B holds and A C holds. So, those are the two we can check. From those, can we infer B C? So, given A B, let me write it here. We can check A functionally determines B, A functionally determines C. From these two, unfortunately, we cannot infer that B functionally determines C. So, just by checking these two, we cannot ensure this. If this implied it, then it is enough to check this and this and then there is no need to check that explicitly. It would definitely be true. If these two are true, that would be true, but this implication is not there. Therefore, to see if the decomposition satisfies B functionally determines C, what do we have to do? The only way is to join it back and then on the join result, check if B functionally determines C, which of course is expensive. So, more formally, we say that a decomposition is dependency preserving if the following holds where f 1, f 2, f i is the set of dependencies in f plus that includes only attributes in R i. So, given a decomposition to R 1, R 2, up to R n, we define f 1, f 2, up to f n like this and we say the decomposition is dependency preserving. If the union of these, these are the ones which can be checked independently on the R 1, R 2, up to R n. If you take the closure of that, if it is equal to f plus, then we can be sure that the decomposition is dependency preserving. If it is not, it is not dependency preserving. Now, how do you check efficiently if a particular decomposition is dependency preserving? Well, there are details in the book. We are going to skip it here. Now, let us come back to BCNF. What have we done so far? We have seen a few things we can do with functional dependencies, how to make inferences, Armstrong-Ixcms, attribute closure. Using these, we can now get back to BCNF. After BCNF, we will come back to canonical cover and see how to use that to get a 3NF decomposition. But now, let us come back to BCNF. So, we have already defined what is BCNF. How do you check if a particular alpha goes to beta causes a violation of BCNF? Well, that is actually very easy. We compute alpha plus and see if it includes all the attributes of r. That is, it is a super key or alpha plus is a subset of alpha. It is trivial. One of the two should hold. So, here we have said it is a non-trivial dependency. So, for non-trivial dependencies, alpha plus must include everything in r. This is for a single dependency. Now, here is a simplified test to check if a given schema, you have given a schema r with a given set of functional dependencies on r. You have given that. How do you check if it satisfies BCNF? Now, if you look at the definition of BCNF, it says you have to check every dependency in F plus. However, I am going to claim that if we are given a set of dependencies on F to start with, this is what we are initially given, then it is enough to check every one of the dependencies in F without looking at F plus. If every one of the dependencies in F does not violate BCNF, so if none of the dependencies in F causes a violation of BCNF, then the claim is that none of the dependencies in F plus will cause a violation of BCNF. Again, I won't prove this formally. No, the catch is this. It is just coming up in the slide. So, if you have given a schema, a single schema with just one relation r with a set of functional dependencies and all you have to do is check if this is in BCNF, then we don't have to compute F plus. We just have to check those. But there is a catch. This thing cannot be done, cannot be used if you decompose r. So, what do we mean? Let's do it by example. Here is a relation r, a b c d e and there are two functional dependencies, a goes to b, b c goes to d. This is what you have given. Now, if I want to check if r satisfies BCNF, what do I do? I take a plus. What is a plus? It is only a b. So, clearly this does not satisfy BCNF. Now, I decompose r using a b into a b and the remaining a c d e. Is this relation, is this in BCNF a b? It is. In fact, it is kind of trivial. Any binary relation will be in BCNF. If there is a functional dependency, clearly it also becomes a super key. No, this is two attributes a b. What is the question? A b. Yeah. A d c h b. Yes. Is it a single attribute? Yes, I am just saying the same thing in a different way. If the relation is binary, then it will be a super key, trivially. And the other one is a c d e. Is this in BCNF? Now, if you just look at these two functional dependencies, it may appear to be. So, a goes to b, does not even apply here. B c goes to d, does not even apply here. You might think that this is in BCNF. If you apply this simplified test using only the given things in F. But, that is not the definition of BCNF. The definition is, you must use F plus. Now, what is there in F plus? F plus actually has the following. We have a goes to b, b c goes to d. So, if you use augmentation, you get a b goes to d. Now, a b goes to I am sorry, a c goes to d. So, we have a c goes to b c, b c goes to d. So, a c goes to d. So, if you have a c goes to d, that clearly shows that this guy is not in BCNF, because a c plus does not include e. So, it is not a super key. Therefore, this is not really in BCNF. So, although it looks tempting to use the only the original dependencies, it is not correct. You should really use. How do you check if a decomposition is in BCNF? We can of course, do the following. Check each r i for BCNF with respect to the restriction of F to r i. That is all F d is in F plus that contain only attributes in r i. This is the definition. These are all the functional dependencies which hold on r i. And these are the ones which we will use to check for BCNF. Or there is a variant test, which is equivalent, but it is easier to do manually. The check is the following. Use the original set of dependencies F, but the test is different. What we do is, for every set of attributes alpha subset of r i, every subset we do the following. Check if alpha plus the closure of alpha with respect to what? With respect to F. Check if the closure of alpha with respect to F either includes no attribute in r i minus alpha. In other words, with respect to r i that would be trivial or it includes all attributes of r i. That is, it is a super key. Now, this test is easy to do manually. You just take every subset. You take a decomposition, take every subset of attributes, compute its closure with respect to the original F and check if this is true. So, this is very easy to do manually. Of course, it is going to take some time, but conceptually it is very easy. So, what we are doing is, we are checking for every attribute alpha, every subset alpha. So, if it is violated by some alpha, then we can show that alpha goes to alpha plus minus alpha intersection r i. We can show this holds, this exists in F plus and it holds on r i and then r i violates b c n f. So, this is the, if this is violated, we can show that this particular functional dependency would in fact show violation of b c n f and we will use this dependency to decompose r i. Conversely, if nothing, no subset violates this, we can show that this particular relation is in b c n f. So, the b c n f decomposition algorithm is again fairly straight forward. We start with a single relation, given relation, which we want to check if it is in b c n f and decompose. This version computes F plus, but like I said we can skip computing F plus by changing the test here and now we keep doing the following. If there is a schema r i in result, result is initially, it is a set of sets. Initially, that set of sets contains only one set which is r. So, if there is, in general if there is any set in result which is not in b c n f, how do you check if it is not in b c n f? We can use the test we just saw. Then, pick a non-trivial functional dependency which violates b c n f because alpha goes to r i is not in F plus and for simplicity, we are going to only take things where alpha intersection beta is empty set. If it is not empty, we could actually remove that from the right hand side and use that one instead. Then, pull it out, pull out r i and insert into it r i minus beta and alpha beta and keep doing this until nothing more needs to be decomposed. So, that is the basic b c n f decomposition algorithm. Excuse me, I have question to ask you. If you look at the normalization, there are various ways to describe these same things, right? Yes. So, you are taking the algorithm root to describe all these things. So, usually we look at the algorithm computer science people. If you have any intention of implementing them as, so is there any work done in automating these processes? If you look at the algorithms we are using based on attribute closure, they are actually fairly easy to implement. In contrast, the algorithm using Armstrong's axioms are harder, they can be implemented, but it is more complex to implement it efficiently. Of course, all these are going to take exponential time that is given, but ignoring that aspect, implementing any of these algorithms based on attribute closure is actually quite easy. You can give it as a course project even. So, let us finish up my part of this session with this example a b c, a b and b c. Now, we know that this is not in b c n f because b goes to c, but b is not a b, we decompose into a b and b c. Now, are these two in b c n f? We can say that kind of trivially, like I said any binary relation will be in b c n f. So, we do not even have to look any further. Here is another example which is a class with course I d title department name which everything thrown in. So, it is a course joined with section basically course joined with section. So, now if you look at this course I d functionary determines title department credits and this is also joined with the classroom relation which is building room number. We have joined it you know the section had building and room number where it means we have joined it with the classroom which gives us capacity. So, we know that building room number determines capacity that is clear. And finally, this is from section course I d section I d semester year uniquely determines where the room meets. So, building and room number it also uniquely determines the time when it meets. So, the time slot I d. So, these are the dependencies which we can from the conceptual model we know modeling a university from that we come up with these functional dependencies. They do not come out of thin as we know that in the real world these are the constraints which should hold from that we are creating this set of functional dependencies. Now, using these we are going to do b c n f d composition. You can also get a candidate key, but let us ignore that for the moment we are focusing on decomposition. We can use this one course I d title department name credit to show that this is not in b c n f because course I d is not a super key. So, we decompose it into course and class 1 which is the remaining attributes course I d plus the we just remove title department name credits from there. Is this in b c n f again we know it is not because of this and this. So, we could use either and also how do we know that course is in b c n f that is another issue. Do we need to decompose course further? No, the only functional dependency that holds here is course I d determines the remaining ones. There is no other functional dependency on this we can show that. So, this is in b c n f, but class 1 has to be broken up because of this. So, we will use classroom building room number capacity and then the other one is section. Now, is this in b c n f again the only dependency here is building comma room number determines capacity no other dependency. So, it is in b c n f how about this one here the only dependencies are this combination of 4 things determines building room number and time slot I d that is only functional only non trivial one. So, again this is in b c n f. So, we can stop. How do you find out functional dependencies for the relation? For a given if they are missing in the say business model where you have missed defining some of these. That is a good question how do you come up with these in the first place. So, the only way to do if you are actually you know building a schema for an enterprise you have to find out from people what are the constraints. In some are obvious if I have a classroom and I want to model its capacity I know that the same classroom cannot have two different capacities. So, some of these are kind of obvious others may not be so obvious, but they are part of the organizational constraints. So, for example should a section have only one teacher or should it all should be allow more than one teacher. So, maybe if you ask somebody in the organization they will say all our courses so far have had only one teacher. So, I think that constraint should hold, but then you should also be careful. So far all courses have had only one teacher maybe, but do you expect that to always hold or do you think the organization at some point may allow two people to teach a course. So, this is a call which you have to make this is a functional dependency hold or not who makes this call. So, again you will have to make sure from everybody before accepting a functional dependency you have to say it not only does it hold today you expect it to hold in the future also or you expect it may get changed in the future. If it may change in the future do not use it. If you have to impose it do it at the application level not at the database level, because if you do it at the database level that is if you use it to decompose your schema you are in trouble. Later on you cannot go back you cannot undo it. Actually my comment is that if you do not have a robust SRS or the business model in the beginning everything will fail actually. Ideally you should have a proper SRS, but again in practice what happens is people do not put all the constraints into the SRS. People are lazy. So, in the end the designer kind of has to use the SRS, but also go back to the domain experts and you also have to update the SRS. Otherwise it is like you know SRS says something you did something else that is a problem. So, you have to go back and update the SRS and your schema based on this kind of interaction with domain experts. Thank you.