 Okay and a very good morning to all of you. So today we are going to be covering database design in terms of relational normalization after having finished ER modeling. So today's coverage looks at relations which have been created either by people starting off and arbitrarily creating a schema or by relations which have been created by first going through the ER normalization, ER design process and then checking for issues in there. However, a lot of the theory which we are going to cover today can be used whether it's with ER models or relations or whatever, it's theory which is required regardless. It's a little bit of work understanding this theory and figuring out what is going on. Things may not be obvious first time around but it is certainly useful even in the real world. Some of the examples we use may be a little abstract and I have heard a few people complain that why don't you use real world examples. We do have real world examples but you should note that real world examples are more complex so we tend to use simpler ones while we are demonstrating concepts so that the concepts are easier to understand and we separately use real world examples as motivation. So there is some work involved in going through this but it's worth it. So let's get going. So first of all let's look at the motivation for normalization theory by looking basically at bad designs. So the best way to learn how not to do something is to first see what went wrong, study what went wrong and then to fix it and in the process of studying what went wrong we try to generalize our observations. So if you treat database design as an art and a person who treats as an art would look at a particular design and say hey there is a problem here and here is how to fix it. And the next time around that you have a new design you again go to this person who looks at it, stares at it for a while and says oh here is a problem this is how to fix it. But that hasn't left you necessarily any clearer about what went wrong and how to fix it in general. Whereas a scientist will look at the same thing and say look here is how to fix this particular problem but here is how to generalize it to go beyond what we have seen in just this one special case. In fact a lot of research in whichever area whether it is databases or any other systems area for that matter tends to happen in this way. People realize there is a problem somewhere people in the industry maybe people in academia who have been building stuff but they realize that there is a problem which seems more general and maybe we shouldn't just fix it by doing a little bit of smart hacking on this one case and then move on. But rather let us look at this problem in more general and see how to improve things overall so that other areas can benefit. So that is a key aspect of how research gets done. So clearly a lot of what we study today was an area of intense research a long time back in the 70s soon after the relational model came up. People understood that there are issues in database design and came up with a whole bunch of theories to deal with them and we are going to look at a few of those. In fact the body of work on the theory of normalization is enormous it is way more than we can even cover in a full-fledged course. So we are just going to be scratching the surface covering the parts which are the most useful. So coming back here is a design which I think any of us would recognize is a bad design. So what is this design? There is an ID name salary department name that looks familiar those are the attributes of instructor but also we have combined it with the department information so that for each instructor we get the building in which their department is located and the budget of that department. So it is looks like maybe this design will make it easier to find out which department person is which building a person is located in and perhaps if you are interested in the budget how to look it up quickly. So there may indeed be some performance benefits in some situations but it would come at a cost and in this case the cost is repetition of information. So what is the repetition here? Clearly the same department name occurs multiple times and since we know that the department name identifies the building in which it is located regardless of which particular instructor we are talking about that association between building and department and building does not depend on the instructor but in this case the department name uniquely determines the building and it uniquely determines the budget. So now if you look here every time the same department occurs say physics is here and down here also the building which is Watson is repeated the budget which is 70000 is repeated. So you can say what is the big deal in repeating data? It was a big deal once when this were very expensive but today 1 terabyte disk is cheap and for most organizations let us say IIT Bombay our database is fit comfortably in a few gigabytes of data. This includes even the photos of our students and employees may be you know 10 gigabytes what is the big deal? Why do we care about wasted space it is actually almost irrelevant today. So then what is the real problem? The real problem is updates so you may start off with a consistent relation where all occurrences of physics are connected to the same building and budget but next year the budget of physics department goes up so now you have to update it. Now the problem is that programmers you know will see some other relation perhaps which contains the department name and the budget the regular department relation and they would go and update the budget over there. So the query updates the budget. Now what about this relation here which also has the same information oops they forgot to update it. So the programmer did not realize that that information was replicated in this relation even in this relation the programmer might have you know written an update which updates one row but not another. So within the same relation you may have inconsistency. So the moment you have repetition there is scope for inconsistency. If you do not have repetition if an update happens we know it will update the only copy of the data therefore it is safe. Now redundancy is actually useful in certain situations. For example almost all data storage systems today will keep at least two copies of the same data on different disks and the reason for this redundancy is that if one disk fails the data is still safe on the other disk. So redundancy by itself may not be a bad thing but the problem is you have to keep multiple copies the redundant copies in sync. And if the database system or the underlying disk subsystem is in charge of keeping all these copies in sync life is not so bad it is not the job of the programmer. And the database developer or the disk the file system developer took care to make sure things are consistent. But if each programmer has to deal with this then there is a problem so that is what causes real problems with redundancy. So we saw the redundancy here. Now here is another example just like the previous one where we combined two schemas but it turns out here there really is not a problem. So what are the two schemas? We had a section entity set which had a course ID which it actually got from the identifying entity set. So it is primary key it is a weak entity but when you convert it to a relation its primary key is course ID, section ID, semester year that entity set also had a relationship with the classroom which basically said where does this thing live. Actually this example is bad in some sense because section ID is not uniquely identified building and room number I do not know how this got in here but if you want the section to be uniquely identified you actually end up with all of these attributes. So to uniquely identify it you need course ID, section ID, semester year will identify the section and building and room number are where it meets. In fact that particular relationship if you recall the section to classroom relationship did not have any attributes. So the only attributes of the relation which we create from it are the primary keys of the two sides. The primary key for classroom would be a building and room number and the primary key for this is the set of four attributes. So now if we combine these two relations what we get is the original section relation we looked at in the first place before we got into ER modeling which has course ID, section ID, semester year, building and room number. So here too we have combined two schemas but here there really is not a problem. In fact in our university schema we have actually combined this. So what is the difference? Why did we run into a problem in one case but not in the other and that is exactly what we seek to figure out by trying to understand what are constraints on data and we are going to see that in a moment. Now going the other way you may say well we have a certain set of relations what if we break them into smaller pieces. So basically when you have a schema what you can do is combine relations or break them up or you can first combine them and then break them up in a different way. So there are many things you could do. So we have seen cases where combining two relations causes redundancy and causes repetition. We have seen cases where combining relations really does not cause a problem. Now how about the other way if we take a relation which we are given and break it up into parts is that going to cause a problem or is it not. So breaking a relation into parts is called decomposition. So when we look at decomposition most of the time we will be looking at binary decomposition where we split a relation into two parts. However in general decomposition may split it into multiple parts. Now of course the parts will generally have overlapping attributes because otherwise the parts will cannot be joined in any meaningful way. So now take this thing. We have this combined schema and instructor and department let us say and now if you look at this relation we see that it has attributes department name, building and budget. And we know that department name uniquely identifies the building as well as the budget. So that particular property that in that schema if a department name repeats it must have the same building and budget for each repetition. In other words in that relation in any legal instance of that relation the department name must uniquely identify building and budget is expressed as a functional dependency which is listed here. Now if you look at this particular relation as we saw before it has repetition. And in fact the original schema we could get by simply decomposing this. So supposing somehow designer started with this combined schema what we would like is to be able to figure out that we can actually decompose it into those other two schema and show that that particular decomposition will not cause any problem. So what problems could arise. So certain decompositions may lose information and they are called lossy decompositions. So here is an example of a lossy decomposition. This time we have an employee relation with ID name, street, city and salary. And suppose we decompose this into two relations. One has ID name the other has name, street, city and salary. Is there anything wrong with this particular decomposition? In fact assuming that names can occur multiple times meaning multiple employees can have the same name. This decomposition is bad. Why is that? Here is a simple example there are two different employees named Kim with two different IDs and of course their street, city and salary could be different. So this is the relation we started off with the employee relations. Now supposing we decompose it into two parts. One is ID name the other is name, street, city, salary. What are the tuples in those two relations? If this was the original set of tuples the decomposed relations would have these tuples. This one has 57766 Kim and 98776 Kim. Similarly this one has Kim main Peririch and this has Kim North Hampton and associated salaries. So this is the state after decomposition. Now what is the common attribute here its name. So if we take these decomposed relations and join them back it would be a natural join on the shared attribute being name. So what do we get back here? We have a slight problem. We get this ID 57766 Kim paired with both these rows and similarly the 98776 Kim is also paired with both these rows. So what has happened? We started with two tuples when we joined back we got four. Hey why is that lossy? It seems to be gaining tuples. Sure it gained tuples but what was lost was information and what is the information we have lost? The information we have lost is the information that 57766 lived on main Peririch whereas 98776 lived in North Hampton. Now if you look here we do not know 57766 is associated with both these addresses and salaries and similarly the other one 98776 is associated with both addresses and salaries. Now which is the correct one? We have no way to figure out from this relation state. So a lossy decomposition loses information but in fact it is easy to show that it will only gain tuples. Every tuple which was there in the original relation will exist in the new one after joining back but there may be extra tuples. So we want to make sure that if we decompose a relation it is lossless term. So here is another small example of a relation and its decomposition. Now in this case we had R A B C which we decompose into A B and B C. If we join it back in this case we get alpha 1 A beta 2 B which is in fact the original relation. So on this particular example this particular decomposition is lossless but can we generalize? Can we say given any relation R we can decompose it like this? The answer is of course no. You can easily have instances of R where the decomposition would be lossy. So what we want is not just for a particular instance of a relation but we want to be sure that any legal instance of a particular relation can be decomposed in a lossless fashion. So the way we can argue about this for the general case is to look at constraints on relations and a constraint is a property that a relation should satisfy. We would say that a relation is legal if it satisfies those properties and then we can say that in the real world a particular relation satisfies those properties. Therefore those are constraints in that relation. Then we can say under these constraints we can show that a particular decomposition will be lossless and we are going to see in little bit how to do this. So if you observe here I am jumping a bit fast over several topics without getting into the details. In fact I am going to get into the details but my goal in the first part of the talk is to give an overview of normalization, the different things that go into normalization theory. So you get an overview and then go into details. And the reason for this is that the details are fairly long and take some time to understand and if you do not understand why you are doing it, most of us as teachers kind of accept that if something is hard it could still be useful because somebody else is saying it could be useful. Unfortunately this generation of students that we have does not believe that. They do not believe us and that can be good and it can be bad. It can be good because some of the things we say are surely wrong. Each generation comes up with new scientific theories which essentially show that the previous generation did something maybe not completely wrong but not in exactly the right way and improve upon how to do things. So that is required. On the other hand if we try to stuff a whole bunch of theory down their throats many of them rebel and do not bother to pay attention. So that is why we have tried to put a whole bunch of motivating things up front before getting into details. The second reason is at least in certain places probably not in this gathering but certain courses which are not for engineering degrees but for other degrees may not want to give their students all the details of what we do. They want a high level overview but their students do not need all the details. In which case the first part of the chapter is really enough to give a overview of what are the issues and then the details can be skipped. Now coming back we had mentioned early on in this course something about first normal form. So we said that a domain is atomic if it cannot be broken up. So coming back here we say that it is atomic if its elements are considered to be indivisible units. Note this word considered. I am not saying they are indivisible units because any time you give anything it can be divided up in some way or the other and we will see how. But the point here is that in the database design and in the application you do not break it up in arbitrary ways which can cost roll. So what is a non-atomic domain? A set of names would certainly be a non-atomic domain. What is a domain? Should be a little bit careful here. A domain is a set of values that an attribute can take. A value of the domain is a particular value. What we want to make sure is that the values in the domain are atomic that they are not broken up. So if I have attribute name the domain of name is set of all possible names but an individual name is a single name and that will not be broken up. But if you treat name as a composite attribute with first name, last name then it would be broken up and it would not be atomic. Another example is identification number for a course like CS 101. Is it atomic? You can actually take out the first two digits letters and figure out that this is from the computer science department. You can pull out the first number and say it starts with one and most probably it is a first year course. These are all things which people have used in creating the identifier. However, what we do not want is for the database or the application to be doing such things and there are good reasons for this. If they start decomposing it into parts then the only way to link up a course with its department is to first pull out the first two characters and then link it up. Now this causes problems when I write a query instead of directly joining a course with its department, I will have to pull out the first two characters and then equate that with the department code in the department relation. So that makes writing queries more difficult. It makes building indices on these relations more difficult. Everything gets complicated. So what we want is any such relationship should be explicit in the form of attributes not buried in the middle of an existing attribute. So that is why we want things to be treated in an atomic fashion. Now relation schema is in first normal form if the domains of all attributes are atomic. So non-atomic domains can complicate storage. One of the reasons that first normal form was stressed upon from the very beginning of relational database theory is that the previous generation of databases, the hierarchical and network data models actually encourage people to create sets of things. So you can say here the student and the student has a set of courses they are registered for and in fact the next thing could go further down. Each of these courses may have a set of teachers and so forth. It turned out that this kind of things which are sets made writing queries rather complicated. So the queries tended to be what are called navigational queries which is say first go follow this pointer then go here then do something then come back up and then do something else. So the queries were rather complicated to write which is not a good thing. On the other hand when we take such structures and convert them to first normal form querying in general actually becomes easier although there are certainly some queries which become more complicated. So we are going to assume first normal form. However people did realize there are certain applications which would like sets of values for an attribute and they went back and added these features in object relational databases. So this is like the go to construct if you have heard of it. The go to construct was the basic way of writing loops if then else originally it is the assembly level construct which was mapped into programming languages. And when people started using it in writing high level programs, programs were often very very hard to understand. So they were called spaghetti code because there were go tos all over the place and nobody could figure out what the hell a particular piece of code is doing. So at that point there was a lot of noise about structured programming and programmers were told do not use go to use for loops while loops if then else and so on which made a lot of sense today it seems natural to us but one generation of programmers had to be forced to give up the go to statement and start using this. On the other hand there are actually a few cases where a go to is actually useful. So there were languages which completely banned the go to statement but other languages took a pragmatic view and said hey sometimes we need go to and we will keep it in the language but tell people do not use it unless you have to. So I would say the same about non atomic domains. Today's databases including PostgreSQL, Oracle and others do support set valued attributes. However use those with care don't use them just like that use them only if you know that there is a good reason for not staying in first normal form. As I said atomicity is a property of how a value is used. So in our database data we are using values like CS 101 then would we say that the domain is not atomic because we can break it up the answer is no. As long as we don't actually break it up we just treat it as a string we are not interpreting it in the database or the application a human might well interpret it that's the problem of the human that's not part of the database design. So in particular if I want to associate a course with the department in the course table I have a department name I do not pull out the first two characters of course and then use that. As another example at one time IIT Bombay used the roll number which had eight characters and every pair of characters in there had a meaning and in fact they used to extract two digits from a roll number to find out which department a person was in. This caused a serious problem in fact there was another digit which said whether for an M.Tek student this person was a teaching assistant or what we call self sponsored meaning they didn't have a scholarship or certain other categories of scholarships each of this was encoded in one particular digit. So now programs would extract that digit from the roll number to figure out something about this person but this caused a serious problem. The moment a person changed let's say from non scholarship to scholarship status their roll number changes not only does that digit change it turned out if you just change the digit the remaining part might clash. So the whole roll number could change several digits could change and this caused havoc because I would be teaching a course and I see a particular roll list at the beginning of the semester and when it comes time to submit grades at the end of the semester I see certain roll numbers have vanished and certain roll numbers have mysteriously appeared. Well what do I do I look at the name and match it if the situation was such that there were two let's say Sanjay Jain's both of whom changed the status I would be in soup I wouldn't know which one was which one. So that's a very bad scheme where a primary key itself changed because digits in the primary key were interpreted. So there was you know I and others who understood database systems had to fight against an administration which insisted on coding digits and eventually we sort of won the battle it's not completely won because when a student is joins IIT they still said those digits but those digits are no longer interpreted which means of course it you know humans still interpret those digits and think this person is in this program but actually that person is in some other program. So it turns out that was a bad idea we should have not done such things in the roll number in the primary key in particular we should not be interpreting pieces and then regretting it when we are forced to change a primary key because of this interpretation. So that was very practical issue with first normal form and it does come up many times so you have to watch out for it. Now coming back our goal in the major part of today's lecture is to establish the theory which lets us decide what is good and what is bad and we are going to have a situation where we start off with relations which are too big and then decompose them. Now it turns out that life is not so easy if you started off with a set of relations which you had accidentally decomposed into two smaller piece the theory doesn't tell you anything about how to go and undo your decomposition which you did in the first place. So in some sense the theory is biased towards starting off with relations which have not been decomposed at all which and then decomposing them that's how we are going to study it. But occasionally the other opposite problem arises. In fact if you go back the theory started off with the concept of what is called a universal relation where they said let's just have one single table representing everything in the world. Of course that's a crazy way of doing things nobody in their right mind would actually store all data in a single table but theoreticians say let's go to that extreme let's start with a single table which contains everything of interest and now let us see what is wrong in that table what repetition occurs and then partition it. So what they did is ensure that you never started off with something which is already decomposed into pieces unfortunately even writing this single mega relation which all attributes is ridiculously confusing for programmers. So in real life we can't do that we end up with some design which we started off with and then we decompose and there is no guarantee that the initial design had not already decomposed in a wrong way. What do you mean by decomposing in a wrong way well we saw what is the lossy decomposition. If your original schema when you started off before you normalized already was decomposed in a lossy fashion then we have problems certain information cannot be represented in the database if you do that. So now coming back we want to know when a relation is in a good form meaning it does not have certain kinds of redundancy and that is based on a theory of functional dependencies we briefly intuitively saw what is the functional dependency we are going to see it formally now and also something called multivalued dependencies and some more details are there in the book we will briefly mention it at the end. In fact there are other kinds of constraints too which theoreticians have looked at and come up with normalization theories which take into account other kinds of constraints. In practice though these are the two most commonly used forms and in particular functional dependencies are by far the most commonly used and so we are going to focus our efforts on those. So as I said the functional dependency is a constraint on legal relations that is what it means is it is a constraint that every legal relation will satisfy if a relation does not satisfy that constraint it is illegal it should not have occurred and it requires that a value for a certain set of attributes uniquely determines the value for another set of attributes and it is in fact a generalization of the notion of super key we will see how. So we are going to use the capital letter r to denote the schema of a relation which will refer to the relation using small letters small r and the schema using capital letter what is we mean by the schema here earlier we said that this schema in SQL includes relation names types and other stuff here we are going to just treat the set of attributes of the schema and the constraints are also there but they are separate from the schema. So now let us suppose we are given a relation small r with a set of attributes capital r constituting its schema and we will use the terms alpha and beta to denote subsets of the set of at full set of attributes of the relation r. Notice that we are saying over here subset of or equal to because in certain cases alpha or beta might include all the attributes in r so it need not be a strict subset. Now we say that a functional dependency alpha functionally determines beta holds if and only if for any legal relation r on r what is this notation small r in brackets capital r means that the relation small r has the schema specified by the set of attributes r capital r and when you say legal relation anything which satisfies the constraints whenever any two tuples t 1 and t 2 agree on attributes alpha then there must also agree on beta that is if there exist two tuples in that relation t 1 and t 2 such that the alpha values are the same then their beta values must also be the same. So here is an example of r a b and if you notice here there are two tuples with the same value for a but they have different values for b. So we can be sure that this instance does not satisfy a functionally determines b. How about the other way on this instance does b functionally determine a it does a kind of trivially because there are no two tuples with the same b but still the definition holds therefore b functionally determines a on this instance. Now note that this is for this particular instance whereas we need to talk not only about a current or particular instance but all legal instances that is where functional dependencies are really useful. So now we can express the notion of a super key in terms of a functional dependency. So if I have k a set of attributes is a super key for relational schema capital R R is the set of all attributes if it only if k functionally determines R. Therefore if two tuples agree on the attributes in k then all the remaining attributes must be the same. It turns out this is not exactly the case when we have multi sets in SQL because in SQL when you declare something as a primary key it cannot have copies whereas when you have a multi set copies are allowed. However in a multi set it turns out that a functional dependency may still hold alpha may still k may still functionally determine R that is if there are two tuples in the multi set which agree on k they will agree on all the remaining attributes too that is fine but that does not mean that k is a primary key in this multi set case because there are two copies of the same tuple. But in the normalization theory which we are focusing on it generally assumes that things are set value. It is not it does not generally deal with multi sets although you can argue about the same things in the context of multi sets. So to keep our life simple we are also going to focus on sets. It turns out that the most database schemas in which we create an SQL actually do have primary key attributes defined. Therefore those relations are sets they are not multi sets. So the schemas which we come up with it is perfectly fine to deal with sets and not worry about multi sets even though SQL allows multi sets. So we are not going to use that feature in terms of a schema design. So now we can also argue about what is the candidate key. k is the candidate key for R if and only if first of all k must functionally determine R therefore it is a super key. Furthermore no alpha which is a subset not here that it is a strict subset not equal is missing. No alpha which is a strict subset of k satisfies alpha functionally determines R. In other words k is minimal you cannot remove anything from k while still retaining its property of being a super key. So we could express super and candidate keys in terms of functional dependencies but functional dependencies are more powerful than super keys. Now take this particular relation instructor department which we saw before it has id name salary department name building and budget. For this what is the super key id is a super key. Department name is not a super key because there are many tuples with the same department name. So there is no way using super keys to express the property that department name uniquely identifies building and budget but using functional dependencies we can. So on this relation we will say that department name functionally determines building in fact it also functionally determines budget and we could even say we know that id functionally determines name salary and department name. We can even figure out that id functionally determines building but we are not going to come up with department name functionally determine salary because that does not make sense. There can be two people in the same department with different salaries. So as I said we use functional dependencies sometimes to see if a particular relation satisfies the functional dependency. So that is the notion of satisfies but the real use for it is as a constraint on the set of legal relations in which case we say that f holds on a relational schema r if all legal relations in r satisfy the functional dependencies in the set f. Again note the notation we will use capital f to denote a set of functional dependencies. As we saw a schema may have multiple functional dependencies and f in general is a set of functional dependencies. Again a particular instance may happen to satisfy a functional dependency for example name may functionally determine id on some instance but there may be another instance where two people have the same name in which case on that instance it will not satisfy. Therefore we cannot say that name determines id holds name determines id does not hold on that relation schema. Now there is a class of functional dependencies which are trivial because they will hold always in particular if you have a functional dependency alpha determines beta. If beta is a subset of alpha clearly if two tuples agree on alpha and beta is a subset of alpha they will agree on beta also it is obvious it holds regardless of the schema it will always hold. Therefore these things are called trivial they do not add any new information. So id name obviously determines id name determines name no big deal so they are trivial. Now some more notation we are going to deal not only with functional dependencies as given but also we can figure out that given certain dependencies others must hold. So as a very simple example if a functionally determines b and b functionally determines c then we can infer that a functionally determines c. Now how can we infer this? Well think this through if two tuples agree on a then they agree on b they have the same value. Now because of the second functional dependency b functionally determines c these two tuples will also have the same c value. In other words what we just proved is if two tuples have the same a value they will also have the same c value. Then by the definition of functional dependency a functionally determines c holds. So we can infer such things and prove them from first principles based on the definition of functional dependency. So we are going to see how to do this in a more systematic way but we can already define the set of all functional dependencies which are logically implied by f. We are going to call it the closure of f and we are going to use the notation f plus where f is a set of functional dependencies f plus denotes the closure that is all functional dependencies that can be inferred from f. Of course f will be included but there are also others potentially. So f plus is clearly a super set of f. So now having introduced the notion of functional dependency in more formally we can define certain normal forms. The first normal form well first normal form we have already seen people also define something called a second normal form which is not interest anymore. So we are skipping that and then people define what is called third normal form which we are actually going to see later but it turns out that more intuitive normal form which is also more useful is something called voice chord normal form and it is defined as follows. A relation schema R is in BCNF with respect to a set f of functional dependencies if for all functional dependencies in f plus that is not only those in f but also consider everything in f plus. Then for all functional dependencies there if the any one of them is also form alpha goes to beta where obviously alpha and beta have to be subsets of R for it to be relevant to R. So in general the functional dependencies may involve attributes from other relations also not just in the given relation R but we will take f plus overall and then see for anything where alpha and beta are both contained in R. So this is a functional dependency on attributes just in R then one of the two following must hold either it is a trivial dependency that is beta is a subset of alpha or alpha is a super key for R. So alpha is a super key means that no two tuples can have the same value for alpha. If this is true we can be sure that beta is not going to get unnecessarily replicated. So if you could have two tuples of the same alpha then they must have the same beta and that is repetition of information whereas here that kind of repetition will not occur once this is satisfied. So coming back to a rinse department schema the functional dependency department name determines building and budget actually shows that this violates BCNF. Why? Is it trivial? It is not beta is certainly which is building budget is not a subset of department name. Is it a super key for R that is is alpha which is department name is it a super key for R? No it is not we know that there are many instructors with the same department name. So department name is not a super key. So what we have just shown is that given this functional dependency this schema does not satisfy BCNF. So far so good we have shown what we saw intuitively. We saw intuitively there is a problem what we have done is we formalized this constraints in terms of functional dependencies and using the theory of functional dependencies we show that you know with constraints which we expect to hold in the real world there will be repetition of information it does not satisfy boys called normal form. What do you do now given that you found there is a problem and the answer is you are going to decompose the solution for all problems in normalization theory is to decompose. So we are going to decompose R into parts. So in particular if we found particular functional dependency alpha goes to beta which shows that BCNF is violated what we are going to do is decompose R into two schemas one contains alpha and beta the other contains R minus beta minus alpha. Now what is this minus beta minus alpha R should contain all attributes in alpha what you remove from R those attributes of beta which are not in alpha. Now this basically handles the case where functional dependency repeats attributes on the left and right this theory will hold even in that case. So in our example there was no such repetition from left and right. So beta minus alpha is building comma budget. So the two schemas we get alpha union beta which is department name building budget and the other one where we remove building budget from the schema is id name salary department name. Note that alpha is common between these two. So when we do a join the join will be on alpha. Note also that after this decomposition we have actually got back our original schema and so in this example we know things are okay. But in general we have to be a bit careful how do we know that decomposing like this is not lossy how do we know it is lossless. In fact it has been shown that it is coming up a little bit later that whenever you decompose on the basis of functional dependency like this the decomposition can be guaranteed to be lossless. So this kind of decomposition is safe and it avoid it ensures that repetition is removed. So far so good. Now if we decompose we get rid of repetition but avoiding repetition is not the only job of a database system. The database system also has to enforce integrity constraints on the database. So if you declared a functional dependency as an integrity constraint then it ought to be the job of the database system to make sure that whenever an update happens that functional dependency is satisfied. It turns out as a matter of fact that SQL does not have support for functional dependencies other than super keys definitions. But still the theory is useful in general. So the point is if the database supported the notion of functional dependencies in the full form and SQL does partially although not completely then when you do an update the database must enforce those constraints. SQL does enforce super key declarations primary key and unique declarations. Now it turns out that if you have taken a particular relation and decomposed it. If you look at individual relations and check if they satisfy whatever functional dependencies hold on those relations it may not be enough. It may turn out that even though each individual relation satisfies whatever functional dependencies hold on its attributes a particular update may cause a violation of functional dependency that spans two or more relations. And what will happen in that case is that in order to ensure what we would like is to ensure the overall set of functional dependencies is satisfied. What we would like is to check the functional dependencies that hold on individual relations because those are easier to check. And then what we would like is that any functional dependency which has attributes from two relations will also be satisfied as a consequence of satisfying whatever holds locally. So we will test locally and show that if these hold the entire set of functional dependencies will also hold. That is what we would like. Unfortunately this is not what we may always get. So we will say that a particular decomposition is dependency preserving. If we do the decomposition check only functional dependencies that hold on individual relations and if these functional dependencies can be used to show that all the remaining ones we started off with will hold then the decomposition is dependency preserving. But not all the decomposition are dependency preserving. In fact we will see an example later which shows that for certain relations there is no way to decompose into BCNF while ensuring dependency preservation and that is the motivation for this normal form called third normal form which says look there is a trade off. I can either ensure no repetition but the problem is that certain functional dependencies now become very, it is not very hard but more expensive to enforce. On the other hand maybe somebody may take a trade off and say look repetition can be avoided by checking for functional dependency. So I will check the functional dependency. I will pay the price but I will allow repetition. So what is going to happen is I am going to reduce the cost of enforcing the functional dependency at the price of allowing repetition. So that is the trade off. Either you reduce the cost of enforcing and allow repetition or you say I will not allow repetition but I will pay more price for enforcing. Those are the trade offs and that trade off is the trade off between BCNF and 3NF. So what is third normal form? The idea is that you want to relax BCNF a little bit just enough so that we can guarantee the following. We can guarantee that for any relation schema with any set of functional dependencies there is a dependency preserving decomposition. So that relaxation is actually very small change. So here is the definition. A relation schema is in third normal form. If for all alpha determines beta and F plus at least one of these holds. Either alpha goes to beta's trivial. This is identical to BCNF or alpha is a super key. This is also identified, identical to BCNF or this is the last part which is one alternative. Each attribute A in beta minus alpha for this particular dependency is contained in A candidate key for R. So R can have many candidate keys. Each attribute here may be contained in a different candidate key but that is okay. 3NF allows it. Now it seems very unintuitive. Why on earth do you want to allow this in? How does it matter whether an attribute in beta minus alpha is contained in a candidate key or not? It turns out this is not obvious why to do this but if you allow this much then we can ensure dependency preservation and we are going to see how to do this. So third normal form is a minimal relaxation of BCNF to ensure dependency preservation. Now it is obvious that if a relation is in BCNF then every functional dependency satisfies one of these two things. Therefore everything which is in BCNF is automatically in 3NF but the converse does not hold. There are things which are in 3NF which are not in BCNF. So to summarize the goals of normalization are to take a relation with a set of functional dependencies. Now who on earth gives you those functional dependencies? That is part of the design process before you start on normalization. You have to look at what you are modeling and figure out what dependencies hold. Now is there a scientific way of figuring out what dependencies hold? Well not quite. It is all a matter of what are the rules of the business sometimes or intuitively what is obvious. Most of the cases you will realize that for a particular entity you have a primary key and that primary key must functionally determine every other attribute of that entity. That is the most common case. But there are other cases we will see some as we go along where you really have to look at the enterprise being modeled and figure out what should be a functional dependency and what should not. For example if you recall we said that a student takes a course in a semester. So the takes relation has multiple attributes. Now if you just start from the ER design and convert to a set of tables then the primary key for takes would appear to be the student ID and all information about the section. So let me write it here. The primary key for takes, so takes has ID, course ID, section ID then it has semester, year, grade. Now if we just took the two entities that this relates which is student and section then the primary key would include ID, course ID, semester, year and section. It should include that. But then we look more carefully at this enterprise modeling and we realize wait a minute it does not make sense for a student to be allowed to take two sections of the same course at the same time. Now why would the student be allowed to take two sections of the same course in different semesters? Well maybe the student failed it the first time and has to take it again. But you cannot take it in parallel in the same semester year you cannot take two different sections. Therefore we would expect that section ID should be uniquely determined by the remaining attributes. In other words if you took all of these things including section ID it would be a super key. But even if you dropped section ID it would still be a super key because you cannot have two instances of the same person, course semester year in the takes table. Therefore section ID can be removed and the minimal thing is ID, course ID, semester year and this is what we chose as the primary key. Although if you started from the year model we included section ID in the primary key. But if you look more closely you will realize that it should not be there and because of some extra constraints which we did not quite model in the year model and then we can modify this. So this kind of analysis of the application is required to understand what are the functional dependencies. They are not obvious. In this case it is very easy to not notice this at all until somebody says wait you did not enforce this constraint. And when does that happen? Well when a student enrolls twice for the same course and then somebody finds out they say hey you should not have done that. The student says hey the system did not prevent me from doing it and then that is why you realize you did not model the constraint. Then if you go back you may end up actually changing the design a little bit because of that constraint. On the other hand you can also make the mistake of enforcing constraints which perhaps you should not have enforced. What do I mean by you should not have enforced? First of all I am saying that it is a constraint which is correct. You do expect it to hold on the schema but it may not be rule which will hold forever. So for example maybe currently in the university each course is taught by only one instructor. So you may say each section is taught by one instructor. So you may say that in the teacher's relation ID, this is the same thing, course ID, psych ID, semester, year that is the teacher's relation. Again from the year model we would have decided that all of these form the primary key. But now supposing university has a constraint which says that a particular section can be taught by only one instructor. You cannot have two instructors teaching a section. Then the primary key would not include ID it would be the remaining ones. If the university changes its mind you can always go back and add ID to the primary key and things are okay. However if instead of keeping a relation teacher separately you actually folded this into section. So instead what we did is I had section course ID, section ID, semester, year and I threw in an ID in that. So this is the and then there are other attributes also. But in my design if I took that relationship between section and instructor who teaches that section and said hey each section has only one instructor and I take that as a functional dependency which I will use in the, by the way the many to one and so on is also an instance of a functional dependency. So if I have a many to one situation that turns into, so if I have section to instructor that teaches being many to one that is the same as saying that these attributes determine this attribute. The primary key of the other side, the primary key of the many side function determines the primary key of the one side that is really the same thing. So over here if I take this functional dependency and use it to create this schema where ID is folded into section then we have a problem. If tomorrow the university changes its policy and says two people can teach the same section then you say oops what do I do now? I have already written all my code assuming this. Now this is a massive change going back and creating a teacher's relation which we would have removed if you folded it in. Going back and creating it can be a major problem. Therefore what you may want to do is when you do normalization don't use functional dependencies which you think may change because that will prevent you from introducing that change without changing the whole schema. So don't use them for the initial design but you may still enforce them on the schema which you create. So for example we keep the teacher's relation but enforce the functional dependency in this case it is a primary key constraint actually which is that ID is not part of the primary key. So we need not have created that relation we could have folded it in but we don't and then we enforce the constraint but tomorrow if the thing changes we remove the constraint that is easy. So that is part of the arc of database design where you have to check which functional dependencies are going to hold now and will hold for the future also versus which if you talk to somebody in the administration they will say yes in our university this will hold but if you think about it you realize that they are just talking about rules you know lot of staff just look at the current rules and say this is God given well assuming God exists we won't get into that but we will say that this is a rule that will hold but later they will say oops we told you that but well we changed our mind and now the rule no longer holds. So database design must be done using rules which will hold. So coming back so we are going to normalize so each relation schema is in a good normal form the decomposition should be lossless and preferably the decomposition should be dependency preserving and we have theory of functional dependencies which we are going to study first before coming back to normalization. So we have a set of functional dependencies and as I said given a set of functional dependencies we can actually infer other functional dependencies. So the closure of this thing the closure of F denotes the set of all functional dependencies that are logically implied given the set of dependencies in F. So that is F plus how do we find what other dependencies are implied by a given set of dependencies there are in fact two ways of doing or three ways of doing this one is completely from first principles you know every time you look at a dependency and you try to argue in this you know we argued why given A B and B C A C should hold. So every time you try to argue why a particular thing should hold which is not at all systematic you may miss things you may make mistakes. So there are two systematic ways of doing it one is to use inference rules. Now there are many possible inference rules but three of these were identified as fundamental in the sense that they are minimal that they are correct and they are minimal but they are complete. So what does complete mean? Well let us first see the rules and I will then tell you what completeness means. So the Armstrong axioms are three rules the first rule says given any set of attributes beta subset of alpha then we can infer alpha functionally determines beta this is actually the set of all trivial functional dependencies and this rule to generate all possible trivial functional dependencies is called reflexivity. The second rule says if you given alpha functionally determines beta then if you add some set of attributes to both sides that set of attributes is denoted as gamma you added to both sides that is called augmentation. So it should be easy to see that if you have two things which agree on gamma and alpha then because they agree on alpha they must also agree on beta because of this functional dependency which is given. So it is fairly easy to show that the augmentation rule is correct if this holds alpha determines beta holds then it is fairly clear that gamma alpha determines gamma beta should also hold. And the last rule is the transitivity rule which we saw by example again this says if there is any functional dependency alpha goes to beta again note that alpha and beta here can be a set of attributes or a single attribute. So if alpha goes to beta beta goes to gamma then alpha functionally determines gamma. Again I gave you a very quick argument for why alpha a functionally determines b and b determines c implies a determines c the same argument holds here. So it is very easy to show that these three axioms are correct. Now what is harder to show but actually holds is that if you apply these axioms repeatedly and make all possible inferences you can given a set of functional dependencies if you apply these make every inference you can it is been shown that every functional data dependency that can be inferred using whatever logical means can be inferred by just applying these axioms over and over again till nothing new can be inferred. And that is the notion of completeness so these are sound end complete. Now here is a small example of the use of these axioms. So here is a relation schema with attributes a b c g h i and a set of functional dependencies a determines b a determines c c g determines h c g determines i and b determines h these are given. Now let us try to infer a few new ones using Armstrong axioms. So here is one which we can infer given a determines b and given b determines h we can apply transitivity to infer a functionally determines h. Now here is one more which is a little harder to do a g functionally determines i how do we infer this can do we have something very simple to directly go and do it turns out we do not but we can do the following we have a determines c we have that over here we can use augmentation on this to say that a g determines c g. Now given a g determines c g and c g determines i we have transitivity which let us infer a g determines i. So this is an example where we needed two separate rules to make an inference in general you may use multiple of you may use the same rule multiple times to make more complex inferences. So multiple rules multiple times in general and you keep doing this till nothing more can be determined. So how do you do this you start with a set f containing only what is given now you take these three rules the first rule actually is kind of trivial you can apply this and generate all trivial dependencies right away it is a huge number if you have n attributes this is of the order of 2 power n different trivial dependencies will be generated straight away. So f immediately blows up now on that set you can repeatedly do augmentation and transitivity till nothing more can be derived. So when nothing new can be derived you are done and that is how you compute f plus now generally speaking you can write a program to compute f plus but computing f plus manually is going to be very very tedious because there are just so many functional dependencies it is easy for a computer to list them you know a thousand dependencies is nothing for a computer for a human it is really painful. So if you want to do stuff manually we want to avoid generating f plus and we will see ways to do normalization without actually computing f plus. Now note that the conditions for BCNF and 3NF were based on f plus but we are going to see tricks which will allow us to not compute f plus as far as possible. Now here is a quick quiz question let us read the question. So given the above dependencies here is a dependency AB determines B now the options are given the above we cannot infer AB determines B we can infer it using transitivity we can infer it using reflexivity or we can infer it using augmentation. So let me refresh your memory reflexivity is this one it says beta subset of alpha then alpha determines beta augmentation is this one alpha determines beta means gamma alpha determines gamma beta and transitivity we have already seen. So the options are this particular one cannot be inferred or can be inferred from one of these 3 rules which is the correct answer B is a subset of AB. So this is inferred using the reflexivity rule. So the right answer is option 3 or C. So now here is one more quiz before the quiz though here is here are a few more rules which are called the union decomposition and pseudo transitivity rules. These rules we are calling additional rules because they can actually be inferred using the original set of rules. However if you are you know manually computing closure you can directly use these rules and you can skip steps to get at what you want to get. So what are these rules the first one says if you have alpha determines beta and alpha determines gamma then alpha also functionally determines beta gamma again arguing this from first principles is very easy showing it using Armstrong's axioms is not that hard the second one is decomposition which is the exact opposite. If you are given alpha determines beta gamma then again it should be obvious that if alpha uniquely determines B and C then it must also uniquely determine B and uniquely determine C there is no doubt about it. So that is the decomposition rule and finally pseudo transitivity which says given alpha determines beta and gamma beta determines delta then alpha gamma determines delta. If you notice the second inference we made in the previous slide let me just go back to the previous slide this one A g goes to I we actually did using A goes to C and C g goes to I. This was exactly an example of pseudo transitivity. So we did it in two steps but if you have this rule we can directly do it in one step. So all of these can be inferred from Armstrong's axioms. Let me explain the question given a schema R with A B C D with dependencies A goes to B and B goes to C then which of the following is a candidate key for R is it A, AC, AD or ABD. So in this case how do you check if something is a candidate first of all we have to check if it is a super key is A a super key what all can we infer given A I can certainly infer A determines B. So B is included C is included but if you look here there is no functional dependency with B on the right hand side. Now this was purposely thrown in to confuse you a little bit. So what should be clear is if I take the closure of A B and C I will never get B in the closure because it is not on the right hand side of any functional dependency. Therefore any correct answer must have D as part of it to make it a super key. So now the candidates are AD and ABD. Now if you take ABD is it a super key yes it is because given ABD and A determines C which is given sorry B determines C we can get A B C D. So it is a key super key is it minimal well it is not because we already have A functionally determines B. So even if we drop B if we drop B what we get is option 3 which is AD and what is the set of things determined by AD well we know A goes to B B goes to C. So B and C are also determined by AD. So the closure includes you know AD functionally determines AB C D. Therefore AD is a super key it is also minimal because if you drop A or D there is not much you can infer you cannot infer the whole thing. So the answer is 3. So this is a good time to break.