 So, I want to spend a few minutes on testing for 3NF. I am going to go a bit fast over this portion given we have only half an hour. So, the first part is that given a particular schema to check if it is in 3NF is actually quite hard it is shown to be NP hard. For that matter checking if it is in BCNF is also hard that is also expensive if you have done a decomposition checking if the decomposition is in BCNF is hard. The next step is the initial algorithms for 3NF would actually do repeated decomposition. So, you would find a functional dependency which shows violation of 3NF and then decompose. As I forget about NP hardness it is not a big deal and come up with some schema. The big problem is that this resultant schema could be this one which is A B. So, if you have A determines B sorry what is the original set of dependencies. So, we had A determines B and B determines C this is the given set if you decompose A B C using this one you get A B A C and that is in 3NF, but it is not a dependency preserving. So, that is not of much use. So, this whole field sub area was kind of revolutionized by a very nice algorithm from a researcher called Phil Bernstein who is now in Microsoft research, but he came up with a really nice algorithm which is called the 3NF synthesis algorithm. They said look all these people are going the other way they are taking a relation decomposing decomposing. Our goal is dependency preservation. Then I take the given set of functional dependencies and generate a schema out of those. Supposing I create one relation for every functional dependency that I am given you know is that a good relation is that a good set of relations is it in 3NF is it lossless join those are the questions. So, the answers are not obvious and what Bernstein's algorithm did is it took the given set of functional dependencies, but then it did a little bit of cleaning upon it. It removed unnecessary parts, it minimized it in some sense and from this minimal set it creates a set of relations and the minimal set is equivalent to the original set the closure is the same. Therefore, it is very easy to see that it is dependency preserving and it has also been shown that it is in 3NF the to show that it is in 3NF is harder I would not even try and the nice part of the whole thing is that this algorithm is runs in polynomial time unlike all the other algorithms which are potentially NP hard. In some sense it does not matter relational schemas have what 10, 15 attributes 20 attributes you know exponential time on 20 2 power 20 is not such a big deal million you know we can live with it, but still it is nice to have a polynomial algorithm. So, how does this work the first step is take the given set of functional dependencies and simplify it remove unnecessary stuff. So, this is called the canonical cover. So, the idea is if you were given a set of functional dependencies earlier we did the closure which is inferring everything possible from it. Now, the idea is the other way we want to make it smaller and smaller without losing any information that is the basic idea. So, if you take this set of 3 dependencies a determines b, b determines c, a determines c it is very clear we can delete this. If we delete it we can still infer it from these 2 transitivity clearly gives us this one. So, let us delete it we have made the set simpler. So, one potential step is completely deleting functional dependencies that are clearly unnecessary they can be inferred. However, there are other partial redundancy problems. So, on this side next example we have a determines b, b determines c, a determines c d. Now, a determines c d has as part of it a determines c, but we already saw that can be inferred from these 2. So, we can actually delete c from the right hand side here and simplify it to a determines b because I can always infer a determines c d from these 3. We can take a determines c d and c and split it into 2 the decomposition rule let us infer a determines c, a determines d and then we delete a determines c because as we saw its redundant and we get this one. So, that is removing an attribute from the right hand side and we have to ensure that we do not lose any dependencies in this case it is clear. There is another interesting thing where take this case very similar except the right hand side the right most dependencies a c determines d. It turns out we can simplify it to this one a determines d by deleting c from here, but this case is very tricky earlier if I deleted something from the right hand side I am potentially weakening it. Here deleting it from the left hand side is making a claim that this stronger one holds and therefore, we can drop the weaker one. If you have these 2 if I had both of these then I can say this is redundant because this stronger one holds. From this stronger one I can use augmentation to add c on the left hand side and I can add c on the right hand side and then use decomposition to remove it. So, I can easily infer a c goes to d from this. The question is to go from here to here can I infer a goes to d from these and in fact I can why from these 2 I can determine a determines c. So, if I take a plus here on this a plus includes d a plus is actually b c d includes d from this I can determine a functionary determines d. So, I can add that to the set. So, I get 4 dependencies these 3 plus a determines d and then because this is weaker I can drop it. So, this is what I lined up with. So, the idea is conceptually I derive a stronger one and then I drop one of the existing ones. But first I have to make sure I can derive the stronger one. So, those are the steps in computing the canonical. So, there is a bunch of stuff here about extraneous attributes and how to determine if something is extraneous. So, the idea is I will drop extraneous attributes. So, the test is based on attribute closure how do I do that take a dependency alpha determines beta and take an attribute a in the side alpha I want to see if it is extraneous. So, we can infer that it is extraneous if we can dropping it from left hand side is actually strengthening. So, what I want to see is if f logically implies this stronger one alpha minus a determines beta. This was our example if you go back here a is actually maps to c actually. So, I want to see if I delete c what is left is a goes to d can I infer it that is what I am checking here. So, I want to check if f implies this one the one on the right extrinity on the right hand side is simpler what we have to check is that after removing this one and adding back the weaker one does it still logically imply the original side. So, those are the two things in any case we have to ensure equivalence the new set of functional dependency has to be equivalent in the sense that the closure is the same in both cases. So, the opposite direction is easy. So, we only have to check the one direction now at this point I am sure many people are totally lost like if you are it is ok this stuff is little non trivial. So, it is ok if you do not understand it fully from here on, but please try to understand whatever parts you can and then go back and read it and get a better understanding of the overall picture. So, let us do this by example we have a determine c a b determine c. Now, b is extraneous in this side because I can infer from these two I can infer a goes to c. So, in fact it is trivial because it is already there in a goes to c a goes to c already here. Therefore, I can delete b from here now I get a set which has two copies of a determine c I drop the second copy and just keep one thing. So, I minimize this by keeping only one thing here. Now, here is another case in this case my claim is that c is extraneous on the right hand side why because if I drop it from here it is ok I can easily infer a determine c. So, if I take a b plus from this I will get c and what do I get after dropping c is a b determines b which I can from those two if I compute a b plus it includes c d c comes from here and d comes from what I have retained here. So, I can easily see that c is extraneous and I can drop it. So, what do I get a determine c a b determines d. So, the point of all this is to clean up the set of functional dependencies to remove unnecessary stuff. Now, this slide I am not going to cover in detail, but what it is doing is giving you a very simple algorithmic way which is also very efficient it is polynomial time to check for externity. So, whatever intuition I gave you in the previous thing is formalized here using attribute closure and we just use attribute closure on individual set of things. So, we take each dependency in f we take an attribute in that dependency and check if it is extraneous and there are two very simple checks shown here using attribute closure I would not go into all the details, but please read it afterwards and the checks will tell us if it is extraneous if so we delete it. So, now finally, a canonical cover is a set of dependency such that it is equivalent to the original dependency. The canonical cover f c for a given set of dependency f is a set which is logically equivalent that is f implies everything in f c f c implies everything in f it has to be equivalent that is our goal. Furthermore, no functional dependency contains any extraneous attribute that is also important that if any extraneous attribute can cause redundancy in a schema. And finally, each left hand side of a functional dependency is unique in other words if I have two things with the same LHS I should merge them. So, to compute this canonical cover it is actually fairly easy we will just do this repeatedly any time I have two functional dependencies whose left hand sides are the same I use the union rule to merge them. So, if I have alpha 1 determines beta 1 alpha 1 determines beta 2 I merge it to get alpha 1 determines beta 1 beta 2. The next step is if I find any extraneous attribute in any functional dependency in the current set I will delete the extraneous attribute and keep repeating this until f does not change anymore. So, that is it very simple thing for computing the canonical cover. And in the interest of time and probably not going to do this I want to have time for some questions, but there is an example here of computing the canonical cover by checking for extra entity. And now the final 3 NF decomposition we call it decomposition, but actually it is also called the 3 NF synthesis algorithm synthesize a set of relations without going through a decomposition process as you will see. So, the first step is computer canonical cover as we just saw then we do the following we just step through each functional dependency in the canonical cover. And if initially we have no R I now we will go through this and right now there is no R I so far. So, this step this if will not be violated and so we add the very first one as a schema. So, if you have a functional dependency alpha goes to beta and f c we add it as a schema. In fact, another way to think about it is for every functional dependency in f c of you know it must be of the form something determines something we create this relation alpha beta of this form. Now, there is a thing for it removing redundant schemas an idea is that somethings may be contained in others. If you will come back to our example let me come to the white board here. So, if I add j k l is a schema and two things j k determines l l determines k. In this case what is the canonical cover it turns out there is nothing extraneous here none of the attributes is extraneous and the canonical cover is this itself this is the canonical cover. So, the set of relations I would create from this would be j k l and l k I would create these two. Now, the redundancy removal step does the following it says, but l k is already contained in j k l. So, I will not store it unnecessarily I will drop this and keep this j k l. So, coming back to the slide here that is what both this step and this step is this step do they delete things which are contained in others. So, this one says that if none of the schemas contains it add it in other words if it is contained do not add it and this says if it was added earlier delete it. So, we will remove redundant schemas and there is one extra step in three NF synthesis which is if none of the schemas r j contains a candidate key create one extra relation which contains exactly one of the candidate keys for r can be any of the candidate keys it does not matter which one, but this step is important if you do not do this your join will not even give you back the original relation it may miss some attributes it may have duplicates and so it may be lossy in general. So, this last step to introduce a candidate key completes the algorithm the intuition is very simple I start with the given dependencies I clean them up remove unnecessary stuff then I create a scheme out of that remove redundant relations from this and I add a canonical sorry a candidate key back if none of the relations so far contains a candidate key. If one of them already contains a candidate key this step is so that is it so the basic idea is very simple the details are little confusing, but what do we get out of all this every relation which we generate thus is in three NF it is dependency preserving and it is lossless all this can be shown it is not too hard to show all of these I think we have it either in the book or in the supplementary material on the web all of these are shown for the three NF algorithm. So, there is another example with this this one had customer ID branch name employee ID type and so on for lack of time I would not go into this example. So, what we have done so far is BCNF and 3NF now how do they compare it is always possible to decompose a relation into 3NF which is in a dependency preserving way it is always possible to break it into BCNF in a lossless join way, but not necessarily dependency preserving. So, what should we do so anyway given what SQL does may be 3NF is not necessarily a desirable goal BCNF may be fine. So, anyway there are certain functional dependencies which you cannot enforce using SQL. So, doing it in 3NF is not actually good enough you cannot enforce it because SQL does not support any functional dependencies other than super key. So, the only thing in 3NF is that there may be some functional dependencies there which are not super key and that anyway cannot be checked by SQL. So, why bother just go with BCNF that could be one option but it is kind of up to the designer. So, I want to take a short break here and take some questions then I will wrap up with multi-value dependencies and a few other topics. Ramakrishna and Shute Coimbatore do you have a question. How to find the candidate key for a very large database is there any algorithm sir? Again by very large database if you mean you know it has a lot of data that is irrelevant the candidate key definition is based on constraints in particular functional dependencies which you need to figure out from the application not from the actual data. This is something which should be true in the real world. So, if you in the real world you want a person to have only one department that is a functional dependency. You cannot infer this from the data per se except there are data mining algorithms which say you know maybe if by looking at real data I can infer something but that is maybe only a first step after that we have to make sure that the real world will actually respect that and this is not based on actual data this is based on the rules of the world. If your college has a rule that person should be only one department there is a functional dependency. If the rule changes there is a problem and if the rule changes and your whole database schema design changes because of that we have a deep problem. So, sometimes we may even let somethings pass if we are not sure that it will always be true. So, for this particular thing instructor department maybe we are not sure that the instructor has a single department. Tomorrow that rule may change in which case we will say that the department is determined by the instructor but it may not be uniquely determined and this leads to a notion of multivalued dependencies which is the next slide which I will come to. Mufah, Khamja, college, Hyderabad if you have a question please go ahead. Can we explain the students only normalization without explaining the concepts of functional dependencies? The question is can we explain the concepts of normalization without explaining functional dependencies? No, normalization is based at the heart on functional dependencies then there are extensions but without functional dependencies you absolutely cannot explain normalization it is not possible. Now, the thing is if you see functional dependencies there are some complex things like canonical cover that is a little complex and confusing. Now, we did not need that to in order to explain BCNF 3NF that came later. Similarly, algorithms for attribute closure and so on can come later. So, the way we have organized it in the book is start with some very basic concepts of functional dependency use it to explain the normal forms and then go into detail on how to do closure how to compute normal forms efficiently and so on. So, depending on your syllabus you could stop at the first part to give people an idea what is there or go into the second part which will enable them to actually use these ideas in the real world. Does that answer your question or do you have a follow up question? This question is a trivial trivial related to trivial initial slides you have given some example with a simple table 1, 4, 7, 1, 4, 1, 3, 7 some small table initially you have explained about trivial. Can you re-explain that concept initial concepts? It is actually a very simple concept at trivial functional dependency is one whose right hand side is a subset of the left hand side. So, let me use the white board and explain. So, functional dependency of the form let us say a b determines a or for that matter a determines a or any other variant a b determines b a b c determines b c all of these are said to be trivial because they are going to hold regardless of you know what is the goal of a functional dependency? It should put a constraint on the real world when I say that department functionally determines budget I am saying a department cannot have two different budgets. So, that is a constraint on the real world, but if you look at dependencies such as these ones that is the trivial ones. Trivial means like you know very simply obvious and why are they obvious? Because if two tuples have the same value for a and b of course they will have the same value for a I mean I have already said they have the same value for a and on the right hand side I am saying then they must have the same value for a of course they will there is no way to violate these functional dependency. See impossible to create a relation which violates any of these functional dependencies. So, these will always hold on any relation which has the appropriate attributes and therefore, these are called trivial dependencies. They do not really tell you anything new they will always hold. If you have a follow up question go ahead. The question is the logical schema that we obtain after a ER diagram are in which normal form? That is a good question is it in what normal form is it? So, first of all as long as you did ER modeling properly in the sense you did not put a set of values and call it a single valued attribute you called it a multi valued attribute and so forth. It will definitely be in first normal form the reduction to relational tables will respect first normal form. Now, beyond this does it respect 3 n f, b c n f and so forth it is hard to say what typically happens is with binary relationships and so forth you will not get into trouble but occasionally you can have a ternary relationship which causes trouble you can have a bad design which had a number of attributes which had functional dependencies amongst them and that can lead to things which violate a 3 n f or b c n f. Otherwise by and large I would say that b c n f would be satisfied, but not it is not guaranteed. We saw an example with the department advisor ternary relationship which violated b c n f. Did it preserve 3 n f I think in this case 3 n f was preserved, but you would have redundant tables in the sense the same information is present in two places. So, there is redundancy across tables. So, this is the example department advisor and then instructor department s is student i s instructor d s department here. So, if you look on the left bottom there is. So, department instructor was turned into a table here this one and department advisor was turned into another table there and this table violated b c n f and if you take the pair of tables there is redundancy amongst them. So, you might say that it is in this case the two tables satisfy 3 n f, but there is redundancy across tables. So, this is actually something which is not obvious. If you start with the error diagram and come up with something you can have redundancy. If you remember I had briefly mentioned a concept called aggregation in the ER modeling domain. That was partly meant to remove this issue it does not do it fully, but if you take this ternary relationship instead of that what we could have done is instructor this is actually not very meaningful for the current example, but at least in this case in principle we could have turned this into an aggregate this is student and the student is related to a particular instructor department pair. So, this was supposed to model you know this situation, but even this does not really help very much the end result is the same set of tables come up even if you avoid a ternary by using aggregation and created binary relationships. The final result is this redundancy will could occur. So, we cannot guarantee anything more which is why you say that even after doing ER modeling take the set of tables look for the functional dependencies that hold and check if it is in BCNF. In this case you would realize that it violates BCNF and maybe you would decompose it further or you might say it satisfies 3 NF and keep it. Any follow up questions those are good questions if there is any follow up questions. Yes sir, like one of the users of diagrams is to model logical schema then is it does it suffice to go for class diagrams or is it necessary to go for ER diagrams as well. That is also a good question which should we go for should we go with UML class diagrams and forget about ER or should we go with ER. So, for modeling data ER is certainly richer than UML class diagrams. UML class diagrams were really meant for modeling object oriented you know if you want to build an object oriented program you model classes and stuff like that. The ER model is richer it has things which those things do not have. UML as a whole has many other things, but if your goal is to do a database design you should probably go with ER not directly with UML class diagrams. The next few topics are the first one is multi-valued dependencies and this you know is shown by these two relations here. The first one records the children of an instructor. So, we have relation ID which is instructor ID and child name. The second one records the phone numbers of an instructor in case the instructor has more than one phone you would land up with a relation like this. Both of these could have been generated from an ER diagram which had a multi-valued attribute. So, this is fine right. So, we have got these. Now supposing for some reason we merge these two schemas into one schema ID child name phone number. Now let us apply our functional dependency theory to this. Does ID function determine child name? It does not. Instructor can have two children. Does ID determine phone number? No. Instructor can have two phones. Does child name determine ID? No, I mean two instructors can have children with the same name. Phone number determine instructor? Not necessarily. They may share a phone. So, essentially there are no meaningful functional dependencies on this particular schema. With no functional dependencies you know 3NF and BCNF are trivially satisfied. There is no functional dependency other than trivial functional dependencies and if you look at the definition they are obviously satisfied. So, it is a great. It is in BCNF. So, it is a good schema but is it? It is not. You see there is repetition. You can have a instance like this where this phone number 1, 2, 3, 4 is repeated. This phone number is repeated. This child's name is repeated and so forth. So, you might say why store it multiple times. Let us keep David with this phone number and William with this other phone number. Let us delete these two middle tuples. Then David occurs once, William occurs once, 1, 2, 3, 4 occurs once, 4, 3, 2, 1 occurs once. But this is adding spurious information. What is this spurious information? This is adding. This seems to say that David is related with this phone number 1, 2, 3, 4 while William is associated with this phone number 4, 3, 2, 1. That is wrong. David has no, I mean the child name has no connection with the phone number. If you want to show that there is no such connection, you should actually have these two tuples also. If you use this schema, this is a bad schema. The normal functional dependencies do not show that it is bad. So, we need something more. This is modeled by having what is called a multivalued dependency. We would say that ID, instructor ID, multivalued determines phone number and it also multivalued determines child name. The intuition behind a multivalued dependency is to say that, I come back here. The connection between ID and the phone number has nothing to do with the connection between the ID and the rest of this attributes. The ID has a set of phone number and there is no further connection between the phone number and in this case the child name. So, the phone number is not uniquely determined, but its connection to ID is independent of anything else. So, then we will say that ID multivalued determines phone number. Similarly, we will say ID multivalued determines child name and using this new thing which is actually, it is connected to functional dependency, but it is slightly different. So, taking into account functional dependencies and multivalued dependencies, we can generate a new normal form called 4NF which is stronger than BCNF. You take all the functional dependencies, the BCNF conditions should be satisfied and thanks to the multivalued dependencies, some extra condition should be satisfied and that gives us 4th normal form. In this particular case, 4NF would force us to decompose this schema back into these two which we started with. Here we artificially combine them. 4NF will say that no, this is bad, these multivalued dependencies exist and therefore this schema is bad and so you should break it back into the decompose it into these schemas. I do not have time to go into details, it is there in the book. Now, couple of other things, the overall database design process I have already told you about universal relations, ER modeling and normalization. I am going to skip those slides because I already discussed them in detail. There are couple of other design issues. There are many ways of coming up with bad schemas. Functional dependencies is merely one of them. Here is something which I have seen people do more than once. Supposing I want to have, I want to track what are the earnings of a company in each of many years. If I want to choose whether to purchase stocks of that company, I need this kind of information. Now, here is a bad database designer who creates a number of relations earnings 2004 with the schema company ID earnings, earnings 2005 company ID earnings, earnings 2006 and so on and so forth. One relation per year. Is this in BCNF? Yeah, it is a binary relation if you see. It is trivially in BCNF. But is it a good idea in such a schema? No, it is terrible. To add new information, you have to add a new relation, not just add extra tuples. You have to create a new relation. Each year you have to keep adding a new relation. That is crazy. So, that is bad. Another kind of crazy thing which I have seen people do is to say company year and then company ID and the first column is earnings in 2004. Second column is earnings in 2005. Third column is earnings 2006 and so forth. Now, each year you do not add a table, but you have to keep adding a new column each year, which is also crazy. So, both of these are bad ideas. But it turns out this thing is not a good idea to store data this way, but it is actually a good idea to display this data in this way and this is routinely done. If you see the, you know, advertisement said companies put out with their summary of their annual report, they have this. They will say earnings in this financial year, last year, year before, last and so on. That is this kind of thing. So, it is an example of what is called a cross tab, where attribute names become column, sorry, values for one of the attributes, which is in this case year. The values for year become column names here. So, the column name here has 2004, 2005 and so forth. So, this is nice for analyzing data. It is not good for storage and many spreadsheets allow you to create a view, which looks like this. In fact, SQL server has a SQL operation called pivot, which lets you create a table like this. It is good for viewing, not for storing. This next slide talks of redundancy across relations. I think I have already explained this in detail using the whiteboard. The next slide talks of denormalization for performance and many people have been asking this question. In the interest of time, I will just keep this slide. And the last topic, which I want to cover is modeling temporal data. I have been hinting at this and I just want to say a little bit about temporal data. I just have two slides here. So, temporal data is basically something, which is valid at a point in time or for some interval of time. So, let us say that you can say I am a professor in IIT. Now, if you take the designation, when I joined IIT, I was not a professor. I got promoted after some time. So, if you say, if you have an employee record with a designation, the designation solution is a professor was valid from say 2003. Before that, it was not valid. When I retire, it will again be no longer valid. So, this kind of record could have a valid time associated with it. Now, a snapshot is the value of the data at a particular point in time. So, I may keep track of the fact that association was as 10 professor in these years, associate in these years, professor in these years. But if you take a particular point in time, only one of these will be valid normally. I would be only one of these three. I would not be two of them simultaneously. So, how do you deal with all of this in the database design process? So, one way is to start with the year model and add time somehow. So, there have been proposals, but no acceptable standard for it. In the context of functional dependencies, there has been little bit more standardized work. And if you take this very simple thing, ID, this is for instructor, ID. There is an address, a street and a city, two parts of the address. Now, at a time, the institute may say, you can only give me one address. Do not give me two addresses. You may maintain two homes, but tell me one place which is what I want to use as your address. But over time, this can change. So, what we will say is that ID functionally determines street city does not hold across time, but it does hold at a point in time. And we will denote it using this symbol here. I will say X, this T on tau, there is a Greek tau, letter tau for T on top of the arrow says that this functional dependency hold at a point in time. That is in any snapshot, this functional dependency will hold. But across time, it may not hold. The same X may be associated with different Y values across time. So, there is some design theory based on that. I am not getting into the details, but let us just look at the practical aspects of how to deal with it. So, one thing which many designers do is, they will ignore time, go do the whole design with ER modeling, whatever, come up with tables. So, coming back here, what we do is, given the fact that, let us take this example of course. This is a little easier to understand. Course has an ID and a title. And we will say that course ID is a key for this thing. It can only have one title. But the fact of life is that over time, the same course ID is used for different courses with different names. Sometimes the name is evolved, sometimes a completely different course. So, I am going to replace this relation by course ID, course title start end. So, start is the start time and some fact holds end is the time in it ends. Now, I can say that course ID is temporal primary key, if at no point in time are there two tuples of the same course ID. So, what that means is, if I have, you know, the start end times do not overlap. So, at any point in time, I will have only one tuple with a particular course ID. Now, how do you enforce such a constraint? There is a lot of propositions from long ago. 20 years back, people proposed extending SQL to add temporal constructs. There was some standards proposed, but nobody implemented it back then. Somehow people ignored it, even though it is a real practical problem. I mean, if you take this particular thing, we have a problem in IIT, we had a problem in IIT, where we kept the student grades with course ID and whenever a student wanted a transcript, we would join the takes relation with the course relation and generate a transcript with course titles. But, there was a problem. After some time, the course ID is renamed. So, if an old student comes back and says, you know, please give me a transcript, I will generate a transcript with a completely different course name, which has nothing to do with what the student took in the first place. This is a big mess. It is a serious problem. It is a real problem. So, what we ended up doing is, we denormalized. What we do is, we create transcripts and store it. It will never change. It does not depend on the active database. It is a separate version, which will never change. So, we do denormalization to deal with historical data. For current data, we store normalized things. But, if a database had supported temporal constraints, we could have avoided this and just had one instance of this schema. So, there are other issues, foreign key constraints. Should it refer to the current time? Should it refer to some other point in time? You know, there are issues there. So, we will, I just want to highlight that temporal issues are important and with that, I will close this section.