 So, today we are going to work on relational database design focusing on functional dependencies and normalization and also touching a little bit upon other kinds of dependencies, multivalued dependencies in particular we will touch upon briefly. We will also discuss the practical aspects of database design. We will cover the formal aspects through functional dependencies, but also some of the practical aspects will be touched upon later in the talk. So, the first part of today's session is fairly theoretical. Along the way I would be very happy to take questions. People usually find normalization somewhat confusing area and when I say people I include myself, it is fairly deep theory. Also people find it questionable area in the sense that if you do ER modeling properly, the question is why do you even need to go through all this normalization theory? Does it matter? Does it have any impact? And the answer to that question is partial yes. If you do your ER modeling perfectly, most likely you will not have very much to do later on when you do normalization. However, not everyone does ER modeling perfectly. There are likely to be glitches and this is what normalization theory will catch. Now, normalization theory started well before the ER modeling even showed up as a topic. ER modeling came quite late actually, quite late in the sense of maybe 7 or 8 years after the initial papers on the relational model. So, during this time and in fact later also it is not as if ER modeling caught on overnight and completely replaced everything else. It did catch on, but other approaches continued to be used. In particular, the theoreticians approach to this whole problem of normalization was to start with what is called the universal relation. That is think of all the attributes of all the things that we may want to model. Pretend there is a single mega relation containing everything, all the attributes. Of course, the attributes have to have unique names. So, that two different things do not have the same names. So, for example, if we think of courses and courses have an ID and we think of people, people have an ID, course ID and people ID should have different names. So, you start off with a mega relation where each attribute corresponds to some attribute of a real world entity could be physical, could be logical and there is one mega relation which is called the universal relation. And then you start from there and apply the principles of the theory to break it down into something which does not have problems. The first step today is to understand what are the problems that normalization theory even tries to solve. So, let us look at a few of these things. This may be a repetition for those of you who are already familiar with normalization theory, but I think it is good to refresh the minds of everybody before going deep into this theory. So, first of all, our ER schema, the university schema was developed using an ER model and then normalization theory was applied to make sure everything was ok. So, the designs that we have used so far are actually good designs. There is nothing really wrong with them. There are alternate designs which are also good much like when you write an SQL query, there are many ways of writing the query and you cannot always say that one way is better than the other. They are just alternative way, but there are also bad ways of doing things. And to show this, we are going to do something unnatural. We are going to take this schemas that we have and munch them around to create other schemas which illustrate what kind of problems can arise. Sometimes it is obvious, if you look at it, you would say why are you doing this? You are mixing together attributes of two different entities and that intuition is usually correct and it is formalized a little bit more using functional dependency theory. So, let us start by taking our instructor and department relation and for some reason if we combine the attributes into a single relation. So, what are the attributes there? If we combine instructor and department, we have for instructor ID name, salvi, department name and from department, we have department name building budget. Note that department name is the same thing in both the relations. It is not two different things. It is the name of the department. Therefore, it only appears one time in this particular combined schema. If you use the universal relation model, if you had instructors, you had departments, you just put together all the attributes. This is what you would get. These are the attributes of instructor and these are attributes of department. Now, what kind of problems arise? Supposing we assume that a department has a unique building and a unique budget. Note two different budgets for the same department and no two different buildings for the same department and we also assume this is not essential, but if we assume that an instructor has only one department. So, these things that these pieces of information are what helps us in the design. They are what helps show that there is a problem with this design. If any combination is possible, if you say that Einstein is in physics department, he is in building Watson and the budget for Einstein's physics department is 70000. But if you go down further, gold is also in the physics department, but the budget for gold could be something else, which does not really make sense in this particular example. But there are other examples where there is no such dependency. Here, we said that department determines the building and the budget, but if it does not, if the department plus the instructor together determine the building, then this design we would get is different. In fact, that is what happens in our CS department here. We were in one building. There was a school of information technology here in another building and then we merged and the two departments became one, but there was not enough space in this building. So, we continued to occupy another building. For other reasons, we also have offices in a third building and soon we are going to get a new building, which will temporarily cause the number of offices to go up to four buildings before it will come down to two when we vacate two other buildings. So, the functional dependency department to building, this is called a functional dependency, which says that department has a unique building does not actually hold for our CS department here at IIT Bombay. But supposing it does hold, you can see that there is repetition of information. Once you say department name is physics and the building is Watson. If physics appears again, the building had better be Watson. Otherwise, there is a mistake in your database. Similarly, if you say the budget is dependent only on the department, then the moment you say that physics has a budget 70,000, if there is another row here with physics and a different budget, there is a problem. So, the functional dependency says that the department here uniquely determines budget. So, the same department appears again anywhere in this relation, then the budget value had better be the same. But the problem with this is, with this particular design is A, it repeats the information. B, it allows for errors. It allows for physics to be associated with 70,000 here and down here, physics may have a budget of 80,000 and it will allow you to update the database and land up in this state. And then you do not know what is the right one. You know something was updated. You know whether this copy of physics was updated and this is the latest or this one was updated and it is the latest. You do not know. So, you have a problem. So, that was a case where a combined schema had a problem. Now, here are two other schemas which we actually got during the ER modeling phase. So, in the ER modeling phase, we had a section week entity and in order to create a relation for the week entity, we had to take the primary key of the identifying entity which is the course. So, we got course ID. Then we had section ID semester year. So, this was the relation we got. We also created a relation for the relationship sec class between section and classroom. Now, what are the attributes of sec class? It had the, this was a mistake. I am sorry, I should have updated this. I saw this mistake during the coordinator's workshop and did not fix it. I will fix it and put the updated slides later. The mistake here is that section class should have the primary key of section which is all of these attributes plus the primary key of the, sorry, the relationship is between a section and a classroom. So, the classroom has a primary key which is building and room number. So, sec class actually has all of these attributes. So, if we combine those two relations into one, this combined relation has all the following attributes, all the attributes of section and all the attributes of section class. In fact, in this particular case, section did not have any extra attributes. Section class would have been a superset and it would be this basically. Now, when we do this combination, there is really no repetition of information. Why? What do we mean by saying there is no repetition of information? So, is any attribute determined solely by something else? Is the room number determined by building? No, it is not. The building has many room numbers, but you know, it is not, the room number is not determined by building. Similarly, building is not determined by rule number and so forth. What about this combination? Course ID, section ID, semester year and then building and room number. So, can you have two tuples with the same value of course ID, section ID, semester year? If you could, then building and room number should be the same. So, if you could have two tuples, that would be repetition of information. But, if you remember our primary key constrained on section, the primary key was course ID, semester year. So, these four form a primary key. So, no two tuples can have exactly the same value for these four attributes. If they did, then their building and room number had better be the same because again, we are assuming in our real world model, a particular course section will meet an only one classroom. It will not meet in two different classrooms. So, under that assumption, these four attributes together form a primary key. If not, there would be, you know, two classrooms associated with the section and then, the primary key would have to include all the attributes here. So, that is for combining schema. So, if you combine schema, sometimes you have errors cropping up. But, there is another option which is to decompose schema. Supposing we started with this bad schema in step. Here we are, this is, we are calling this in step, which has instructor attributes plus department attributes. Now, what we said is that this schema is bad. But, supposing we decompose this schema into two things. On what basis do we do this decomposition? Now, we know that if we actually broke it up into instructor and department, the schema is good. But, that is because we started with that. If we just gave you a relation like this, on what basis do you break it up into multiple relations? And the basis is what is called a functional dependency, which basically says, take the intuition here. If there were a schema department name building budget, then the department name would be a candidate key. In other words, department name uniquely determines the budget and the building. That is a property which constraint we are assuming on the real world. If the real world represents this constraint, then we can make this statement. And based on that statement, we formalize this statement incidentally by saying there is a functional dependency from department name to building budget. So, the right arrow here says that the left hand side determines the right hand side uniquely. If you had two different tuples with the same value of department name, they better have the same value of building and budget, because the department has a unique building and budget. So, that is the idea. Now, given this dependency in this combined relation in step which we saw, department name is unfortunately not a candidate key. What this means is, we are allowed to have two different tuples with the same value of department name. If you did that, immediately building and budget are going to get repeated. So, this functional dependency shows us what is wrong with that original schema. The fact that there is such a dependency and at the same time department name is not a candidate key. In other words, it is not unique. It can appear more than once in that relation, then there is a problem. And how do you solve the problem? You decompose. So, the decomposition is what we are already familiar with, instructor and department. So, this decomposition was good. On the other hand, there can be bad decompositions which lead to problems. So, let us take this relation, slightly extended version of employee. In our version of employee, we had ID, name and salary. Let us throw in two extra piece of information, street and city. It does not really matter too much, but let us add this. Now, let us take this particular decomposition. One relation is employee one ID name. The other relation is name, street, city, salary. So, when we decompose, we are breaking things up, but we do not want to lose information when we decompose. And in fact, this particular example, if we assume that names can repeat. So, what is our dependency here? An employee ID is unique. And the employee ID determines the name, the street, city and salary. Given a particular employee ID, these are, there cannot be two different values for any of these attributes. They are uniquely determined. However, we are not saying anything about name. There could be duplicate names. That is realistic actually. So, this particular decomposition is actually what is called a lossy decomposition. It loses information. So, let us show this by means of an example here. So, here is an original relation. There are many rows. Let us focus on these two rows in the middle here. So, there are two employees with different IDs. One is 5776. The other one is 98776. Now, both of them are named Kim. Why Kim? Well, in Korea, apparently, half the people have a surname Kim. So, it is a very common name. People are likely to have the same name. Now, here we have just stored one name. But even if you stored first name, last name, in Korea, there would surely be a lot of clashes given that there are many, many, many people named Kim. In India, we have equivalent with Gupta and Shah and Desai and many other common surnames. There are many people with exactly the same name. I think in the down south, Srinivasan is a very common name. We had Suresh was a very common name. So, in my class, we had in engineering, we had, I think, maybe five Ramkumars, two of whom had the same initials. So, clearly clashes are possible. Now, those two Kims live in different places. So, their street and city may be different. Their salaries may be different. Now, this was the decomposition we decided on. ID name in one relation and name street, city, saloon, another. So, how do we create this decomposition? We take these two tuples and project them on ID name to get this. So, the contents of ID name include these two rows, 5, 7, 7, 6 Kim and the 9, 8, 7, 7, 6 Kim also over there. And similarly, if we project on name street, city, saloon, we get these two tuples. So far, so good. Now, if you want to combine this information and get the information back, we want to get the ID name street, city, saloon of these employees after the decomposition, how do we combine relations? We do a join. In particular here, because we did a decomposition, a natural join is the natural way to combine the information back. So, we do a natural join. What is a natural join? It makes sure that common attribute names have the same value. In this case, name is the common attribute. So, we match tuples, which have the same value for name. Now, unfortunately, both these tuples have the same name. Both these tuples also have the same name. So, what happens is this first tuple matches the first tuple here, which was intended. But the first tuple also matches the second tuple here and we get this extra tuple down here. And similarly, the second tuple matches both those. And what have we got back? We have got four tuples where we had two. Now, you might think, hey, there are more tuples. So, we actually have more information. But in fact, that is wrong. Just because there is, there are more tuples, does not mean the information is more. What has happened is we have lost the information that this particular ID lives in this street, this city and this salve. If you look here, this particular ID is associated with two different addresses and two different salve. So, we have lost the information that this ID is associated with actually just this one street city salve. So, although we have gained extra tuples, we have lost information. We did not lose tuples, we lost information. That is why this decomposition is called a lossy decomposition. In general, whenever you decompose and join back, you can only gain tuples. Any tuple that was there in the first place would be regenerated. The only effect is you get extra tuples. So, lossy in this context means you get extra tuples, but you lose information. So, that has been. Now, here is another small example where we are simplifying things. And in fact, we are going to use small examples like this in this chapter. Why? Because if you keep long attribute names, the examples become long, the text overflows into multiple lines. So, we are going to rename attributes as A, B, C, D, E, F, G and so forth. We are going to abstract it away. So, if you have a real world thing, we can just assign letters to the attribute names and then work with those letters. We are going to work on relation schema. So, the schema in the schema, the relation name will be, well actually the schema is the set of list of attributes in the relation. We will use capital letters. We sometimes use the notation R equal to A, B, C. Sometimes we just say R of A, B, C, meaning the schema R has the attribute A, B, C. Now, that is the schema. This is a particular instance of that relation. So, here we have alpha 1 A, beta 2 B as the two tuples. And if you project it on to A, B and B, C, what do we get? Alpha 1, beta 2 is the first one and on B, C you get 1 A and 2 B, so far so good. Now, if you join these back, what happens? You get again it is a natural join, the common attribute being B. So, you get alpha 1 matched with 1 A, because B is repeated, it is only shown once. You get alpha 1 A. Similarly, beta 2 B is the next tuple. So, what we got is exactly the same relation back. So, in this case the decomposition is lossless. But, can we say that this is a particular instance of R? So, supposing we take a particular state of the employee database as it is today and we do this decomposition and join it back and say everything is fine. Is that good enough? Well, let us come back here. In this case, the problem was shown because of two employees with the same name. But, supposing in our database, there were no two employees with the same name at the current point in time. This may change tomorrow. You may get another new employee with exactly the same name. For a long time, I was the only Sudarshan in the faculty at IIT Bombay. I mean, there were other Sudarshan before me, but when I joined there was no other Sudarshan. And a few years back, another Sudarshan joined, which of course cost a fair amount of havoc with letters meant for him or calls meant for me going to him or to me. There was a lot of mix up for a while until people realize there are two Sudarshan. Anyway, coming back, at this point it is fine, but the moment you get two rows with the same B, the decomposition is not lossless. So, what we want is some constraints which will be guaranteed to hold. It is not just that they hold on the state of the database now, but that they will hold on any valid state of the database. So, that is something which we are going to go into a lot of detail in just a few minutes from now. Now, before we get into that, there are some other things which we need to address. So, using this notion of functional dependency and certain other dependencies, database theoreticians starting from Codd himself. As I told you, Codd was the father of the relational data model in some sense. There were others before him who had proposed it, similar things, but he put it together, came up with a whole theory including the theory of functional dependencies and so forth. A lot of the stuff which we are going to study today was pioneered by Codd, thus added on to it later. So, now they defined a series of normal forms. They say that a relation is in a normal form if it satisfies certain property. And the normal forms were defined starting with some very simple properties which you absolutely want relations to satisfy. And then some more stringent properties which occasionally you might allow things to violate for efficiency reasons for some other reasons. So, the first normal form, this is the basic principle which Codd said that he wanted domains to be atomic. What does it mean for a domain to be? First of all, what is a domain? A domain is a set of values that a particular attribute can take. And we said that a domain is atomic if all the elements in that domain or all the specific values which are which an attribute is allowed to take are indivisible. So, what are examples of things which are atomic? Normally, you would think that a salary value is atomic. You cannot break a salary number into parts. Now, you can break a salary overall into parts, basic D A and so on. But if you keep a single number for the salary, you cannot break that up. Similarly, if you have maybe a city where the person resides, you cannot break the city up. Well of course, cities have been broken up, states have been broken up, countries are being broken up. But that aside in the database, you cannot break the name of the city into parts. On the other hand, what are things which are not atomic? Name in particular a set of names. So, supposing we had employee record and we keep a set of dependence as one of the attributes. So, it is a multi-valued attribute. So, if you sort a set of names of children, then it is indeed divisible. Each name in that set is a separate thing. It refers to a different child. So, it is not atomic. What about composite attributes? We had name which consists of first name, last name, middle initial and so forth. Is that atomic? Actually, it is not. It can be broken up into parts. So, in the relational model, when you create actual relations, you break it up. But this is actually not such a big deal. Composite attributes, you can just pretend, you know, it is sometimes useful to refer to them as one and sometimes useful to refer to them as part. So, we can think of having individual attributes, first name, last name, middle initial and then name is something which is a convenient way to refer to all three attributes put together. So, composite attributes are not really a problem in this context. But there is another more insidious problem which people often do not realize. If I have a course name like CS 101, if I can break it into CS and 101, then it is not really atomic. Why? What do I mean by if I can break it? Obviously, I can take the first two characters in the last three characters and I am breaking it. But the question is, is this break up meaningful? And in this case, if you say that CS refers to computer science department, then pulling out the first two characters of the course gives you back meaningful information. So, in some sense, this course ID is not atomic. It can be broken up. However, we do use course IDs like this and the point to note is even if it can be broken up as long as you are careful about not breaking it up and using the broken up parts in your application and in your queries and so forth, then it is ok. So, we treat it as inducible although human might break it up. In the database, we do not pull out two characters and do stuff with it. Now, that was not always the case in IIT Bombay. A while ago in our early databases, people said why store department separately with course ID? We will just pull out the first two characters of course ID and then interpret it as a department code. This is not really a good idea because at some point, we wanted to add course IDs which were three characters and then the code which broke it into two characters stopped working and there is a bit of a mess there. So, this is one example of why you should not be breaking it up. Now, if you had stored a relation separately which links a course with the department, just look up that relation. What is the big deal? Just one look up on that relation will give you the department of a particular course code. So, do not do string manipulation on it. Now, what is the problems with non-atomic values? Why was called against non-atomic values? There were several reasons. First of all, it complicates storage, it complicates querying, it complicates query optimization and so on. So, one of the reasons a relational model succeeded was because it was very, very simple. Because it was very, very simple, creating systems to store relational data was relatively easy. Because the thing was very simple, it was relatively easy to come up with query languages for it and because the query languages were also relatively simple, it was possible to build optimizers which could take a query which was written in the query language and then figure out how to do it efficiently. If this had not been the case, then query optimizers would have been terrible and people would not have used SQL. They would have gone back to writing code directly on some underlying API and not used SQL and then lost all the benefits of SQL. Now, this is actually, it is a really, really big deal. It is not obvious, but this is a big deal. You keep things simple, then you can get other benefits and this has been proven multiple times. So, the first time around that this was proven was when people said, but why should we stop with simple relations? Why not add features to relations? Let us add such value attributes. Let us add objects. Let us add inheritance and so forth. So, they added a lot of stuff. They created object-oriented databases. They added object-oriented features to SQL databases, but it did not take off. One of the major reasons it did not take off was once you added all this complexity, it was actually very easy to write queries which the optimizer really could not execute very efficiently. So, you write a query which seems like a logical query, something which should run reasonably efficiently, but because the language is complex, the optimizers could not do everything that they ought to do, could take a lot of effort to make them powerful enough, which was not done initially at least. So, the query would end up running very slowly and people will say, this system sucks. I do not want to use it. It is very, very slow. Let me stick with relational databases. So, many of these things, people put a huge amount of effort building them and then they really flopped at the marketplace because practitioners who tried using them found the performance was terrible. This history repeated itself with XML databases when they were first launched. Again, the performance was terrible. People tried it out and said, no, no, we are not going to use this. We will stick with relational databases. Later on, companies put a lot of effort into XML storage and databases and actually made them work quite well. However, by then people had burned their fingers and did not want to go back to it. So, it was too little, too late in some sense. So, the simplicity of the relational model played a huge role in its success and we should not forget this. So, that was with respect to query languages, with respect to storage also. If you have non-atomic values, how do you store it physically? Well, there are ways of doing it. In fact, if you think about it, we saw multivalued attributes in the ER model and we saw exactly how to convert it to relational model. We simply broke it up into a separate relation and stored multiple tuples. And this is eminently doable even if you start with the multivalued attribute in the relational model. You can break it up into a separate relation and store it. All that chord said was, do this directly and expose the underlying schema. So, we are going to assume that all relations are in first normal form. Now, here is another example. I talked of course, IDs, but a similar example repeated itself in IIT Bombay, where students are given roll number, which they do not look exactly like this. But essentially, somewhere in the roll number, there is initially a year followed by a department code. It is not CAC, it is some number followed by which program you are in MTech, BTech followed by if you are an MTech, are you supported by a TAship or are you finding yourself or something else and then a serial number within that. So, this was some old habits that our academic office had and they created a big fat roll number of 8 digits. And guess what happens? People do change branch. People do change MTech students who joined a self-finance. Sometimes, there are teaching us, then seats unfilled sometimes and we give it to them and they become teaching us. And guess what academic office did? They went and changed the roll numbers. Fantastic. So, at the beginning of the course, I have a list of roll numbers, I conduct exams, I give marks based on those roll numbers. And at the end of the semester, the roll number has changed. The academic office would give me a list of roll numbers for which I should give grades. And there are roll numbers here who never took the first few exams. Now, what on earth do I do? How do I know the mapping? It is madness. I mean, we complain loudly to academic office and eventually, we force them to stop changing roll numbers at least. But they did not give up the practice of encoding all these same things in the roll number, which is actually very bad in spite of me as a database person going and telling them, look, this is bad database design. They refused to listen. Even today, we have exactly the same scheme there. The only thing that they do not do is change the roll numbers in the middle. And then, we have all kinds of problems with people looking at a roll number and saying, oh, this person is in EE department, but actually this person is in computer science department because they had a branch change. Now, in the database, it is not a big deal. Just go do a look up. Why do you have to do all this kind of stuff? Anyway, so it is a very bad idea to encode these kinds of things, particularly in primary key because changing the primary key involves all kinds of updates. If you have to change a primary key, there are many things referring it. In the example I told you, the marks which I have for various exam are not in the database, but they are actually foreign keys referencing the roll number. And if you change the roll number underneath me, I am in trouble. The update to the roll number should be cascaded to all things that refer to that roll number. So, it becomes a potentially expensive update. So, this also brings up a general principle for choosing primary keys. You really do not want to choose primary keys using attributes which might change. The attributes might be sufficient to uniquely identify an individual, but if it is going to change, do not choose it as a primary key. For example, name and address would probably uniquely identify an individual. It is unlikely there are going to be two people of the same name and the same address. So, you could choose that as a primary key, but names change, addresses change and then it has cascading effects on all things that refer to it. So, as part of database design, choose the primary key wisely using things that are not going to change. So, coming back up to a theoretical side, we want a theory which does the following. It decides whether a particular relation is in good form. We saw the first normal form. That is a very basic thing. There are many more normal forms coming up. And if it is not a normal form, decompose it into a set of relations such that each relation is in a good form. So, I want to break a given relation into pieces. Again, this reflects the history where people said, let us start with a single relation containing all the attributes of interest from everything that we want to model and then break it up. So, we start with that mega relation and then break it up if you follow that theory as is. However, we generally do not follow that theory in that direction. We end up doing air modeling and land up with a set of small relations, but we can still apply the theory on the smaller relations which we get. So, our theory is based on functional dependency. So, I am going to spend a little bit of time on what are functional dependencies. Functional dependency is one kind of constraint on the set of legal relations. What do we mean? We mean that any legal relation, legal meaning it is allowed in the real world. Any legal relation must satisfy some properties and these are the constraints. And functional dependency is one kind of such constraint. There are many, many other constraints. We have already seen primary and foreign key. That turns out primary keys are really a special case of functional dependencies as we will see. And the functional dependency requires that the values for a certain set of attributes determines the value for another set of attributes uniquely and it generalizes the notion of a key. So, let us take an example. If I have a set of attributes capital R, we will use capital letters to denote sets of attributes and we will use Greek letters, alpha, beta and so on to denote subsets of these. Whenever I say subset, I generally mean subset of or equal to. So, as a special case alpha could be equal to R, beta could be equal to R. Now, we say that given a relation schema R, we will say the functional dependency alpha determines beta holds, holds is the key term here. If and only if for any legal relations R on R. So, what is this notation? R denotes a relation and instance on this schema capital R, any legal relation, anything which is legal in the real world. If any legal relation, whenever any two tuples in that relation have agree on alpha meaning they have the same value for alpha, then they also agree on beta that is they have the same value of beta. In other words, if you have two tuples T 1 and T 2, if T 1 alpha equal to T 2 alpha that is the projection of T 1 and alpha is exactly the same as the projection of T 2 and alpha, then the projection of T 1 on beta must be equal to the projection of T 2 and beta. Now, take this very small instance R A B with A being 1, 1, 3 and B correspondingly 4, 5, 7. Does A functionally determine B hold? It does not because we have these two tuples which have the same value for A, but they do not have the same value for B. Therefore, A determines B does not hold. Now, on this instance B determines A does hold, there are no two things to the same B. So, trivially you know this condition is never invoked and B determines A holds on this particular instance. That does not mean that it will hold on all legal relations, all we are saying is on this particular example instance the functional dependency holds. Now, in general what we want is functional dependencies on schemas which means that it holds on all legal relations. So, that is what we are going to focus on. Although we can say what functional dependencies hold on a particular instance that is usually irrelevant to us. So, we are from now on when we say a functional dependency holds, we are only going to think about or refer to functional dependencies that hold on the schema. We do not care about anything that holds on a particular instance, but may not hold on another instance. So, now we can define super keys easily using functional dependencies. K is a super key for relation schema R if and only if K function determines R. So, if two tuples have the same value of K, they must agree on all attributes of the schema. In other words K uniquely determines a tuple in R. Now, candidate keys can also be defined formally using functional dependency. So, we can say K is a candidate key for R if and only if first of all K functionally determines R that determines all the attributes in the schema. And moreover for no alpha subset of K, thus alpha functionally determine R. If it did then alpha would also be a super key and then K would not be minimal because there is subset of K which is also a super key. And therefore, if K is not minimal it cannot be a candidate key. In other words what this is saying is K is a candidate key. If it is a super key and it is minimal no subset of it is a super key. And what about primary keys? Well it is basically one of the candidate keys is chosen. How do we choose it? The theory does not define it, but as I told you we should choose things which will not change. Now, we saw in depth earlier and we saw some intuitive things about it which said department determines building and budget. We can formalize it using functional dependencies. So, for example, here we would have the dependency building sorry department name determines building that in the real world we are assuming a department has a unique building. We would also have department name determines budget it is not shown here, but it would hold. You would also say that ID determines building. Why? Is that obvious? Well we know that department determines building. We also have assumed in our model that an employee is in exactly one department. Therefore, we can say ID determines department. If ID determines department and department determines the building then you know given a particular individual they cannot be in two buildings. So, we can infer that ID determines building. So, what I have just shown you is an intuitive notion that if I am given two functional dependencies I can actually infer another functional dependency and we will see this formally coming up. However, we would not expect the following dependency to hold department name determines salve. Why? Because there may be two different people in the same department with two different salve. So, just because two records have the same department name in this in depth relation we cannot say they have the same salve. So, let me finish this slide and then we will have some time for questions after this. So, what is the use of functional dependencies? As I said we can test relations to see if they are legal and we will say that a relation are satisfies F if R is legal under a set of functional dependencies F. So, this by relation here we could be referring to relation instances, but the part which we are really going to use is that F holds on R if all legal relations on the schema satisfy this set of functional dependencies. So, this is formally there is a distinction between instant satisfying and holds, but since we really do not care about instances going forward sometimes we are loose with terminology. We will say satisfies or holds regardless of what we say we mean that holds really that is it is true of all legal relations plus bullet here we have already seen. So, I think this is a good point to take some questions before we go into the theory of functional dependencies. I am seeing a few questions on chat. Let me take a few of these questions and then I will come to live questions. One of the questions says first normal form says values must be atomic then how come normalization supports multi-valued attributes. Normalization does not support multi-valued attributes. The whole idea is you start with a relation in the first normal form and then apply this theory because a functional dependencies and so on do not really apply to multi-valued attributes. There is a notion of multi-valued dependency which we will look at later, but the formal theory of multi-valued dependencies is also done in terms of relations which are already in the first normal form. So, the basic assumption is that we have already got stuff to first normal form. How do we do it? We know how to turn multi-valued attributes into some flatten it, but if you want a universal relation model you actually just flatten it into one relation with all the attributes. So, from there we start the normalization process. The next question says the domains of all attributes atomic. Now, while designing tables in some real world application the data type is where care and thus allows us to store more than one telephone number in a single set thus atomicity is lost. Can we say that such a table is in first normal form? That is a good question. Once you have a wire care you can put anything in there, you can put comma separated lists and create sets and so on. The relational model is not able to do anything about that. However, if you see the operations which are there in the relational model there is no basic operation to split an attribute into parts. Now, an actual language like SQL for practical reasons allows you to do such splits, but the model itself does not allow you to do anything with those kinds of splitting. So, if you start interpreting the values your relation is not in first normal form. If you do not interpret it and break it up then it is in first normal form. So, in some sense being in first normal form is a property of how you use it not just of the raw data which is stored there. The last question I am taking off chat says can we combine two schemas if they have different sizes of tuples. What I think you mean is that the number of attributes in each tuple is different. Yeah, it does not really matter. We can combine the schemas. How does it matter? Now, let us take a few live questions. Sushila Danchan, Maharashtra, please go ahead. What is the key value and what is the key column and non key column in functional difference? So, the question is what is the key column and what is the non key column. So, a key is often used by many authors to mean a candidate key. A super key is something which is a super which uniquely determines and many books and papers when they say key they mean a candidate key. So, we just define what is a candidate key. It is a super key which is minimal. No sub part of it is a super key. So, that is what a key is and we can define a key as we saw in terms of the functional dependency. We have a CCOM engineering college West Bengal. CCOM, do you have any question? What is spurious tuple? So, I am not sure. We do not use that term in a book and some of these terms have different meanings depending on who uses it. Spurious generally means something which is not real. It comes in as an artifact of something else. So, for example, if you have decomposed incorrectly, if you join and then get some extra tuples in the result, that could be considered as spurious tuple in the join after a bad decomposition. That should not have been there, but because you made a mistake in decomposition, you get extra tuples which are spurious tuples. That is probably the context in which you saw it. Let us move on to theme college Maharashtra Boisar, theme college, go ahead please if you have a question. Sir, actually in first normal bomb, we do not require the multiple values. So, here we are giving the primary key. What do we do exactly the super key? So, I think the confusion here when we said multiple values, what we mean is multiple values within a single attribute. I think the confusion here is when I say multiple values, you may be thinking that you cannot have a super key having two attributes. No, that is not the point at all. The point of multiple values is in a single attribute, if you store a set, then there is a problem. It is not in first normal form, but there is no problem having a key which has three different attributes. So, of course, each attribute has its own value. So, there are three values, one per attribute. That is fine. That is not an issue at all. Does that answer your question? Sir, one more question. What do you buy determinant in functional dependency? So, there are again different ways in which this term determining is used. So, for example, you may look at the real world and determine, figure out that a particular functional dependency holds. You can also use the term in the sense that if you have a functional dependency, a determines b. You can say that value of a determines the value of b, meaning that if I am given a value of a, there cannot be two different values for b, they can only be one value of b. So, it is used in that sense also. So, maybe what I am seeing is, people are asking a lot of questions about things they want to know and I am going to cover those. So, maybe I should just continue with my next few slides and those will answer many of these questions. So, what we are going to do now is look at theory of functional dependencies to some extent. We are going to go deeper into it later. So, first of all, we would say that a functional dependency is trivial if it would be satisfied by all instance, all possible instances of a relation. Let us say that there is no constraint at all on the relation. Any value at all is possible. In spite of that, a class of functional dependencies called trivial functional dependencies will always be satisfied because the right hand side is a subset of the left hand side. So, this is an example. ID comma name determines name, functionally determines name. This is kind of obvious. If you take two tuples in any possible instance, if they have the same value for ID and name, obviously they must have the same value for ID, I mean it is trivial. Similarly, name determines name because if two tuples have the same value for name, of course, they will have the same value for name, it is obvious. So, in general we will say that the functional dependency alpha determines beta is trivial if beta is subset of or equal to alpha. So, why do we even care about this? Well, when we do various things, we will exclude trivial dependencies because in some sense they do not matter. They are obviously going to be true always. So, we will exclude them.