 In fact, I am not going to follow the sequence of the slides that he has given, but I was looking at the interaction that you had. I will try to indicate to what should I say practical examples of one example of what we call denormalization, which is often required. The other is the issues about including temporal aspects in your database design. There are slides, I do not know in the abbreviated version the slides are there or not. Let me very quickly mention the more important practical aspect, which is the third normal form. As a matter of fact, you will recall, because this is what I have seen in most courses on databases. We actually teach first normal form, second normal form, third normal form and BCNF is often thought as a subsequent thing to turn normal. That is often the sequence of teaching. Technically what Sudarshan has emphasized is right. BCNF is the core normal form and a diluted version is actually third normal form. So, I do not know whether that gets emphasized in our teaching or not. Ideally, we should get BCNF, we cannot get BCNF, we get CNF. Some of these slides indicate what are the situations under which we might be required to do this. Why is third normal form important? I will skip these slides, because they say essentially that third normal form is a weaker form of BCNF. To the extent it is weaker, that could be indicated how it is weaker. Basically, if BCNF is not dependency preserving and if efficiency is a problem, then you could incorporate it. So, for example, you want to guarantee that there is always a lossless joint. A lossless joint can be guaranteed if you have this thing is in the third normal. You might permit some redundancies in the process. So, here is an example of the third normal form. But rather than going through these slides, as I said, some of these would be covered during the tutorial problems, I will indicate one observation, which I think Prasudarshan has also made. Canonical cover is an important theoretical concept. Again, there are algorithms which will determine that canonical cover and you can actually go through that. You can compute the canonical cover and then devise the CNF decomposition algorithm from canonical cover. What I am going to very briefly describe is the practical way of doing things. The practical way of doing things is actually to start with a good model. So, you take a rudimentary model, an ER model. If your ER model is well designed, then invariably the schema that you will get by translating those relationships and associations will be in CNF and that is what is often done in practice. I have seen hundreds upon hundreds of cases of real-life database design. Hardly 5 to 10 percent of those cases, the designers have bothered to actually apply functional dependency algorithms and cross check very rigorously for what that is. Hardly 5 to 10 percent. That is the unfortunate consequence of theory not being practiced rigorously. However, most designs in real life do derive from a good model. So, the emphasis is on modeling. If you have taken care of identifying all entities properly and particularly identifying what looks like an attribute of an entity but should be an independent entity. A clear example in the case of the schema that we are discussing is department. You might inadvertently consider department to be an attribute of a teacher, for example, because the teacher belongs to a department. But if you look at the fact that department has a building, department has a budget, etcetera, etcetera, at the modeling stage itself, you would recognize that department is a separate entity. Now, if you do that, you will have a model in which you will say, so this is let us say my teacher and I have the ID and I have everything else and let us say somewhere here when I say department, I recognize that department is an entity. So, I remove it and I make department as a separate entity. If I make department as a separate entity, then I have a department ID, I have a department building, I have a department budget, etcetera, etcetera, and then I can set up a relationship. You will notice that what we were trying to do earlier through a mechanism of decomposition. So, decomposition theoretically assumes that all the attributes of the schema you have assembled into a single relation first and then you are trying to decompose it into multiple relation. In real life, that does not happen. In real life, you will start with relations which replant different entities and which replant different association sets. And when you model them into your schema, you will find invariably, you will get a decent schema. Take for example, a student and a course relation. When you model course independently, when you model student independently and when you do not involve a teacher anywhere here, then the basic entity of course is modeled. There is no issue of update anomaly as you mentioned. You can update any field which because every non key attribute is fully and non functionally dependent upon the key attribute that you have identified, which is the crux of the third normal form as we say non transitive dependence. So, if we have that, you will find that the table that you get, say text or register for whatever you name, you will have essentially the primary key from here. So, this is let us say role and primary key from here, let us say course ID. Now, when you get this, this relation is third normal form, this relation is in third normal form, this relation is third normal form. This is the way practically people determine their scheme. Very rarely people test these things. Now, what is our emphasis? Why we want to do this? There are two aspects according to me. This entire chapter of decomposition is important from a theoretical perspective, fundament. People must understand why we are doing all this and if we do not do it, what does it imply? What does it imply if I cannot preserve a dependency? There could be significant consequences in the actual operations. When I update things, when I insert things, when I do changes in my database, would they preserve the properties that originally were conceived to be valid for that entire data? It is from that point of view, you must do this. Secondly, you would expect that at least some of your students who join organizations such as Infosys, TCS, etc., etc., etc., and when they do database design, they are aware that in addition to the mundane mapping of your ER model into schema, something extra should be done, something extra should be tested. You will invariably find such testing being done in the top-notch organizations when they do schema. So, TCS will do that, Infosys will do that, Wipro will do that, HCL will do that, but there are hundreds upon hundreds of organizations and many of our students are currently doing good jobs in these organizations. Should we not expect them, A, to be fundamentally aware of what these things are? B, to be aware of what are the algorithms in case they wish to apply them? And C, with that knowledge, perhaps some people will proactively do it. Some others will do it because the organization demands it. That is the real-life story. It is in this context that rather than the closure of the functional dependencies, the attribute closure appears to me to be an important point because you can apply that logic practically, examining things on smaller relations much more easily rather than finding out the cover. There are tools by the way, database design tools where you can input functional dependencies and you can determine the canonical cover, etcetera. So, when you have 10 or 20 attributes, that is trivial, but when you have 200 attributes, it is very difficult. In general, in real life, in large applications, you will always begin with a set of different relations. You will never have a single relation containing all 10,000 attributes that the corporate world defines. So, you will have several independent relations. It is possible to apply these algorithms on these independent relations. And my suggestion is emphasize to the students, indicate to them the algorithm for the functional dependency closure as well as the attribute closure and suggest that looking at the attribute closure, they can practically apply it very quickly to determine whether something is preserved or not. So, I will just say this. These things would be covered in a tutorial, I think, some problem. You have a 3NF decomposition problem? Is it that? In the tutorial? Yeah, okay. So, we will just take it up. I needed to do something else. So, this is an important side. The goal for a relational database design is ideally BCNF, lossless joint and dependency preservation. But if we cannot achieve all of this, then we may either have lack of dependency preservation or you may have redundancy due to the use of 3NF. And the real world goes after this option approach. I will use 3NF. I still preserve dependencies, but I do not mind some redundancy if I get, okay. Now, this is the dilemma that SQL per se does not have any constructs, which will guarantee any one of these things. SQL faithfully implements whatever you tell it to improve. So, the constraints that you define will be implemented. And that's another reason why I, when I intervened, I said that one should be very careful in defining constraints at the SQL data definition level. This is in contrast to what I said on the first day. We must implement all constraints as database definition constraints and not implement them through PLA SQL or something else. Because that's a feature that SQL provides now. However, what constraints you determine to be defined at the database level should be carefully done. So, all constraints which are not only valid today for the organization, but which are likely to remain valid for a long time to come should definitely be implemented as database constraint definitions. All other constraints, there should be a considered view. What should we take? In fact, in many of the course projects, which are team projects, we ask them to do a fancy stuff. Whenever they take a design decision in a final report, only the decision will appear that this is my schema, etc. What we ask them is what alternatives did you consider? And what is the reason for rejecting those alternatives? That is where the application of mind happens. It is not that you will in one shot get just the final result which is submitted. That thinking process, we should advise our students and collect teachers. In fact, in December, that thinking process should be captured. So, what are the alternatives they consider? For example, for even the schema, what are the functional dependencies they actually wrote now? Did they examine whether this functional dependency is preserved or not? Many times, you will find that in the, particularly for the students who are in a hurry to submit something, they will either not apply their mind on these alternatives or even if they do, they will not incorporate it in the report. I think it is important for us to emphasize they do both because that is where thinking happens. And if you permit them to force them to think and ask them to capture that thinking in a report, it will help. Now, what I propose is that even though the teacher's workshop is only a 10-day duration. We call it technically two weeks because ISTA recognition of two-week workshop is useful for the teachers. However, in 10 days, it may or may not be possible for them to do all of this. So, that is why we will form teams right from the first day. And this time, we will advise teachers that they will have to do a sort of project, part of which they have to do it within the workshop and part of which they have to do it later. And I would like you to emphasize this when you deal with teams of the teachers who come to your place because if teachers understand the importance of thinking of alternatives, writing those down, reporting them, then it is likely that they will insist that the same thing be done by students later. I think that is what is important. So, very quickly, let me just mention. I am going to skip multi-valued dependencies. So, when an ER model is carefully designed, identifying entities correctly, the tables generated from ER diagram should not need further normalization. However, when you translate an ER model into a schema, you will not necessarily get BCNF. But invariably, almost without exception, you will get it into a third normal form if your basic identification of entities is correct. And the key point in the design process to tell people is to say that, please look at each attribute of an entity that you have identified and imagine whether that attribute itself could be a separate entity. That is where given a set of attributes, when you later on say that this depends only on this subset and that depends only on that subset, what are we saying? Effectively, we are saying that by mistake, two different entities have got mixed up in one table. So, looking at it while modeling is a much easier and a much better process. Now, that emphasis again is not indicated. Perhaps these examples in the decomposition could be used to construct examples of a good ER model itself. While modeling, how for example, a department may be mistakenly written here and then it needs to be identified as a separate entity could be emphasized, that will result in a better process. There is one more thing here. Temporal data is extremely hard to handle. So, this is one example I wanted to give you how to handle temporal data. Temporal data means time. Time actually with time things change, but we are unable to record what was the information in this time or that time because usually when you update a table, the current information is present. Take a simple case. Take for example, all of us faculty position. Let us say HR database is there and my situation, my current position, my current salary is maintained in a table. I update it, my current salary is so and so. What about the past record? When the government of India says such and such pay commission will apply from this date, that date is usually 5, 6 years ago. Now, if my database has no information about what was my salary 6 years ago, there is no way you can apply thing. So, usually what happens even if you have a good database, people will go to their files, paper files and say, oh, this was a salary. This happens because we are unable to maintain data in a temporal fashion. So, there are temporal databases which we are not discussing, but it is important to tell our teachers and through them to our students that temporal information is extremely important. I will give you a practical example of a situation where temporal information is required. Let us take insurance policies. Let us talk of general insurance. Let us say, a factory has been insured against fire or a car has been insured against this theft, that whatever. So, for everything insured, there is a peril. That means, it could be fire, accident, whatever, whatever. There is actually a sum insured, that is, insurance amount which is the premium that you pay, sorry, which is the sum insured which you will get in case of peril occurs and there is a premium amount that you pay. Let us say, this insurance policy number, say, P n is valid from, let us say, date 1 to date 2, which is typically 1 year. Three months later, a fellow comes and says, look, I had insured my car only against third party, but now I want comprehensive insurance even if there is accident I want to do. I have already paid this much premium. Whatever is the difference, premium I will pay now and please improve my policy to something or let us say, a factory was insured against fire. He wants to insure it against flood, which is another clause at a certain point in time. Now, a very curious situation happens. There is a D 1 dash here to D 2. He wants this policy P n to be modified to P n dash. If I had only a single database, I would update the policy terms, I will calculate the new premium, collect the difference and I will say this policy is valid up to D 2 with new terms. Exactly three days later, a claim comes from that fellow saying that such and such accident happened between D 1 and D 1 dash and what happened was my factory was flooded, please pay me. If I do not have this information about D 1 dash, then I will never be able to know that look that fellow is cheating. The flooding occurred actually here, but he was not covered for flooding at that point in time. This is a temporal incident. Consequently, what insurance companies all over the world including India do is they maintain versions and each version is effectively a separate P n dash, P n dash dash, P n dash dash, etcetera, etcetera and this is maintained through what is known as endorsements. So there is a policy there are certain terms. At certain date you say I want this additional cover, then an endorsement will be written. This endorsement effectively creates an equivalent of a new policy which is effective from D 1 dash to D 2 only. Another endorsement may come after two months. That endorsement may include the previous one or may nullify some of it. That will have to be maintained. Now if a claim comes against insurance saying that now give me this money, you examine the date of the event and you determine which of these versions apply to that date. Then you look at only those conditions which are printed or which are stored at that point in that version and then examine whether the claim is justified against those rules. If there are certain endorsement in a policy's life, there will be ten versions that will have to be maintained. All of them will have to be maintained either as current database because any time a claim comes you want to sort of locate that date. This is one way of maintaining temporal information. Invariably in many practical situations such as teacher's salaries or any HR system, etcetera, etcetera, you have what is known as a current info table and you have a history table. For practically each schema relation where changes need to be preserved in time. So, typically you take an employee's information wherever you have that will have to be preserved in time. Even a course, take a course. The course credits were six credits. CPI was calculated as per six credits or in most university norms. Let's say it was a 100 mark paper. The university changes that norm and says this paper will now be 75 marks. The name of the subject is same. The syllabus is same. Everything is same. Marks is 75. That is fine. The students who take that exam from this year onwards they will be evaluated for 70. Generally one gentleman from Jharsukura comes and says three years ago I had failed. Now I am appearing for this course. Now that fellow has to be given marks out of 100. This is a non-trivial business requirement. How do you handle that? So, you preserve such information into history table. It is not uncommon in many situations to find that if you are let's say 100 table schema, then at least for 30 tables you will have additional 30 history tables. So, for example, an employee salary changes. The change in salary will be implemented in the current table. But you will insert a history record there which will say old salary this much. That means from this date to whenever something changes from entry will be made in the history table. Two entry will be nothing. Whenever that salary changes again the next date there will be a two entry here and a new entry for the new salary new position whatever. Now it is a question how many fields you provide for changes. So, you do not. You just provide information and you just have a from and to date saying this row is valid from this date to this date. Even if let's say designation changes some other time salary does not change. You will have same row where the designation will change but it will have a from and to date. Now obviously for an employee if let us say five of these attributes typically salary will change annually designation will change 3, 4, 5 times. There could be some other rules which will change. Every time a change occur one tuple goes in. And what should be the key for this tuple? Obviously the MPID will be there. But MPID is not adequate because the same MPID will have multiple rows here. So, people artificially put a serial number or sequence number. So, MPID plus serial number MPID plus serial number. If there are 20 rows MPID plus these 20 one MPID 2 MPID 3 MPID 20 and the employee's current record. It is not difficult to write SQL queries for this. You give us employee code from one SQL query you extract information about that employee from the current table by another SQL query. You extract all the 20 rows or 50 rows of activity and all these rows together represents the information about that employee. So, this is a minimal way in which the temporal information has to be taken care of in real life. Unfortunately the conventional database theory that has emerged because in the beginning it emphasized preserving current information. So, that is what update is. You know update means remove present information and put something else. It does not take care of the temporal preservation of that information. This has to be done. Let me very quickly describe the denormalization requirement by giving you an example which we just encountered in a large data warehouse designed for SEBI. SEBI is the securities and exchange board of India and it supervises operations in stock trading across the country. There are two major stock trading agencies. One is Bombay Stock Exchange, one is National Stock Exchange, BST and NSC. How you buy and sell securities? All of you are familiar with stocks, but most people like us do not do speculation, do not do day trading. We buy a stock sometimes, we sell a stock sometimes, price goes up, etcetera. In real life that is not what happens. People will buy, keep buying and then they will keep selling. At the end of the day they might have zero money exchange. They might have zero stocks exchange, but they would have done maybe 100 crore worth of buying and selling in the process. The procedure of day trading or any trading for that matter because you cannot distinguish between day trading and the regular trading from the nature of the input that you get in the stock exchange. So, you will get for example, what is known as orders. So, order data will mention let us say L and T script and I want to buy or to sell. So, there will be my name for example, ID, my ID. I am placing an order for buying or selling these things. I will mention a quantity and I will mention a price. When I make this offer, if there is a if there is an offer that I am making for buying, if somebody is giving an offer for selling and if things match, it will result in an actual transaction. But if it does not, this order will remain. After sometime I can put a fresh order. I can put another order. I can put quantity 10, 20, 30, etc. at different prices. Such thousands upon thousands of orders are put into the exchanges every day and it is not just thousands upon thousands. It is not uncommon to have something like 10 million such orders. What happens is some of them result in an actual transaction. And that is called a buy and sell record. So, that buy and sell record is buy. So, who bought it? Who? Quantity, price and sell. Again who? So, who sold? Who bought? The quantity and price must be agreed upon prices. This is one record that gets inserted. Obviously, there is a separate table for preserving this. There is a separate table for maintaining all this. What is the SEBI's problem? SEBI is typically interested in investigating whether people are doing some artificial transactions and causing the price to increase in the market. Because all this information is available to everyone. And if there is a lot of people buying, buying, buying, people will say, oh my God, the price will go up. Let me also buy. And that way the price actually goes up. And the original fellows who designed this artificial stuff walk out after selling of whatever they have got at higher price. They make a killing. And the other fellows are left in the large. So, there are about 20 different ways of doing some fancy stuff which are well known and identified in the stock market itself. One is called circular trading. So, I sell to you, you sell to me. I sell to you, you sell to me. What is happening? I am increasing the volume of things being sold. So, suddenly people believe there is a lot of activity, important share and the market price will go up. I am buying and selling at roughly the same price. So, we are not affected. We are friends. But somewhere in the afternoon we both will decide. So, we will make money and get out. But this is considered bad way. Now, the SEBI will like to catch us. How will it catch us? Because we are placing orders. The actual execution may or may not happen. Or when it happens, if it happens between me and him consistently, I need to catch. How will I do that? I will do that by taking a join of this and this. There are additional squiggles. I want the cumulative quantity that has been ordered by someone for buying or that has been ordered by someone for selling at the time when the sale or buy occurred. There is a time id which is most important. There is a time id here also. So, how do I calculate the cumulative sale, a cumulative quantity of orders that I have placed? Up to the point where the actual transfer occurred, I will have to add up this quantity, this quantity, this quantity from one particular individual. There will be millions of them, from one particular individual. Up to the time that that individual name appears in actual buying or selling. So, that is the cumulative quantity. Essentially, we are saying that we will require to join these two. Without joining, we cannot have that. Well, easy. Join is a standard thing that we do. The only problem is, I may have, let us say, 100 million rows here and about 10 million rows here. How will you find out? Now, this is for one day. The circular trading has to be observed over a period of time. And that period of time has to be a moving window for six months. How do you do that? You see the tough real-life problem. Efficiency is of a fundamental concern because I want to catch these culprits as early as possible, preferably in real time, preferably by tomorrow morning. It won't help if my join takes about 36 hours to execute. And we are not talking of small PCs and the netbooks. We are talking of very large machines. And still, the data is very large. So, the typical data warehouse data which will sit for one year is of the order of four terabytes. And they will, they have to preserve data for about seven sort of years or something because investigations, when they start, what happened then? There was a problem here. That is how scams get caught. They have one of the most wonderful surveillance systems in the world today. I know because I was fortunate to lead that effort from the technical side. But here is a question. When you do a benchmark, if these queries have to be run very frequently by different investigating teams, then you can't obviously say the whole machine can be used by just one person doing this query because simultaneously something else is happening. How do you handle this? So, the correct way is to recognize that, look, I can't spend time in joining these fields. They are proper third normal form. In fact, BCNF schemas individually if you look at it. There is no dependency. They are independent activity. The way you denormalize is that you recognize that every time an actual buy or sell occurs, you can construct two artificial order records. One called order for buy. Another for order for sell. But you flag it by saying that this is the actual culmination into an activity. So, what you do is you take this and break it into two record, a buy order and a sell order. And you put both of these back into this scheme. So, you have already 100 million such orders because orders you keep placing. Only when sometimes the matching will occur, you will say, okay, I buy at this rate or something. There will be multiple such orders. Now, whenever this order has happened, the same time, the same time id, but who placed an order for buy means he actually bought, but who placed an order for buy, what quantity I will insert it there with that time. And I will put an additional flag, say actual transaction Z t. All of these will have false, false, false, false, false. There will be two records which will say true, true, false, false, false, false. Now, if I simply access this table in the time order sequence, in the time id sequence, and then I make a query that what was the cumulative amount which was ordered by db for buy up to the point when he actually bought. I don't have to do any join. My query is on a single tape. In fact, a large number of such queries of this time of detecting fraudulent behavior can be done on a single tape. But what about update? Fortunately, in a data warehouse, you don't update because things which have happened in real life are being recorded. In real life, you may have to provide for update. Consequently, what you do is you have this table, you have this table and you may create a view. And while uploading the data into the database, you materialize that view. So, effectively, you have a third table which is as big, but it preserves this. Alternately, you will say why do this Godagiri? Because all other information that I want from this order table is available. If I say I will look at that table for other pieces of information by neglecting such records which have this here, then I have only the pure order information. So, I am augmenting this order table by something which really is not a dummy order, which is actually a transaction. Now, this example I thought of giving because enormously increases the efficiency. So, just to give you a benchmark result, when we are doing this benchmark test for what we call the worst rates and such other things, typical execution time when we are attempting to do joins was of the order of 36 minutes. And when we change this, that change of course requires for all seven years you will have to put the data like that. So, loading the data for seven years may take two days. But on a daily basis, it is not very difficult. On a daily basis, you take this table and push these data here. So, that is preprocessing that you have done. But after having done that, the order was about 4.5 seconds. I hope you appreciate the difference between a 4.5 second efficiency versus 36 minute efficiency. That is what you can get by normalization. This is probably an extreme example. Let me see if I can sort of get Savi's permission to pin this down and give this example as one of the additional examples which we could give to our teachers that this denormalization could be so useful. Now, it is a drastic example. But in normal case also, you do not run our normal applications on very large machines. So, there are realistic constraints even in real. So, consider a situation where a certain transaction, certain query will take 30 minutes or the order of several minutes versus of the order of several seconds. And if that query has to be done very frequently, then it is better to denormalize. This is one example of denormalization. I think I will stop here. Excuse me, professor. I have a question to ask regarding this Savi this thing. This is very big data. And as you said, in the past what has happened, these are just the facts and only kind of operation that is possible is the read operation mostly. So, are they trying to look at using some other data model which will facilitate the read type? The data model is being developed. But this is what the data model that has been finally, the physical data model and the logical data model are two different things. Logical data model has usually what you see in the data warehouse. But the physical data model is implemented with such denormalization in order to help the thing. This is not in tune with the kind of data modeling that we are discussing here because by and large the data warehouse data modeling is a different thing altogether. But I thought this to be a very good example of how much efficiency can be gained by denormalization. So, yes, you are right. It is only for read-only queries and there is no update which is possible. Because I think Google and other people are using kind of unstructured way of storing. In fact, at one point people thought that the database research has culminated into saturation and now people find that data management is throwing newer problems. So, this is a field which is alive. There is a conference called very large database conference which you remember I mentioned we conducted it in 1996 here. Now, what is very large has been changing. The first physical disk, if you remember the IBM introduced, it has a fantastic capacity of 7.25 megabytes. Big huge disk, this pack in fact 10 platters 7.25 megabytes. Why your thumb drives today have 4 GB, 8 GB, 16 GB easy. So, very large was once megabyte. Then sometimes it becomes gigabyte. Then it became terabyte. Today it is petabyte. The processing power has also increased and yet the problems at the critical age of functionality always demand more. And as an added attraction when the back end databases are increasingly being used by the web front ends and therefore the users who used to be 10 users, 20 users, 100 users, 10,000 users are now 1 color users. So, life is fun. I think we will conclude with that.