 So, we will look at the notion of normal forms. We are not going to look at the normalization theory per se that is covered in the fifth chapter of Prasudarshan's book and you can read that up. What I will try and do is I will try to give you the basic concepts of functional dependencies and how exactly you can even if you start with some arbitrary collection of data in tables, how you can examine the schema of those tables and get the right design in a normalized form that is what we shall look at. Then we will move over to a different topic where we will discuss the modern multi-tier applications. Currently what you have been doing in the labs are from a front end PC you are connecting to a back end database. So, traditionally you call it a two-tier architecture, but the better architecture is a three-tier architecture and in general the multi-tier architecture what are the issues there. More specifically we will look at the problems of necessity of integrating multiple applications not just within the organization, but across organizations. This has given rise to a more recent standard which is still emerging, but lot of it is already well established called service oriented architecture. We shall see what exactly service oriented architecture is because our endeavor is that the new applications that we will develop should generally comply with the S Y and finally we will look at the performance modeling where the issues of performance what are the basic issues. We will look at some little theory there actually the theory contains a law called little's law we shall look at that we shall look at some notion of benchmarking. Much of that material comes courtesy Dr. Rajesh Mansharamani of Tata consultancy services we are using those slides he has agreed to release those slides in the open source. So, we will to begin with in this session we will look at software design. The specific topics that we look at are a very brief and quick review of system analysis. We had discussed the ER model in the first session we had not discussed the data flow diagrams that is the other component of analysis. The only important point that we take for today's discussion from the data flow diagrams is the notion of stores, but it is useful to look at the notion of data flow diagrams in general and will then concentrate on table design where we will introduce the notion of normalization. This is the review of an SRS document I think in the first session I had given you a glimpse of the document. The document would generally have an introduction it will have functional specifications it will then have data models and associated data dictionary. The data models would could for example be ER model or it could be an object model it would have a data flow model. The ER model captures the static characteristics of data and the interrelation. The data flow model defines the dynamics of how the data moves across the organization and how it is transformed in the process. Additionally, there are user interface requirements there are requirements of interfaces to other systems. There are procedures and workflows which need to be explicitly identified. For the entire software and acceptance criteria must be defined right at the time of system requirement specification. So that if you hand over that activity to someone else whether it is an internal group or outside group that group knows exactly how the software will be accepted after it is developed. We generally do not do these kind of things internally but I would suggest that this should be done. For example, there should be a user acceptance test which should be done by a group other than the software development group. It is very important to keep that partition. We had looked at ER model but let us look at the flow model how the data flows. So as information moves through the system it gets transformed. We will again go back to the example that I had cited of the course registration and the type of registration each one of you would fill up a form and that form is the entry point of information but as the information comes in and if you have designed the internal storage tables using entity model for example you would transform that data. So you will not keep the form as it is or even if you keep it will be a transient storage. You will essentially transform that and the persistent storage that you will have will be your tables in the data. However, when the information moves further for example in analysis of the performance of one individual student where you calculate the cumulative performance index at the end of all grading then that report has to be submitted to the student in a form different than what appears inside your database because in that output which will be a grade report you must print the student's name, the roll number, the other intercedents, not only the subject code but also the name of the subject, the grade obtained, the total cumulative performance etc. None of this information is available in a single table in your database. So there will be another transformation. The point is as the data moves the data comes in, gets transformed, gets stored, gets processed, gets further transformed, may get further stored and ultimately goes out to the users. Modeling this flow is an important part of the system analysis. A transformation of any kind represents a system function. So basically system function means there is some application logic that gets applied. How do you extract this information? What do you multiply this with? How do you add this and that? That is the specification of your application logic or algorithm. So correspondingly any transformation which describes a system function will ultimately get implemented as a program module. We now distinguish between data and program module. Identification of program modules and what specifications for the functionality that program module would have is the job of the analysis of the data flow. Such flow and attendant transformation is captured. There are multiple models again. The data flow modeling or data flow diagram modeling is one of the ways. The data flow diagrams is a diagrammatic tool and therefore easy to understand for at least simple application which can be comprehensively read. As the application becomes more complex, the data flow integrities will be many more. Consequently the data flow model or the data flow diagrams themselves are organized in a hierarchical model as we shall see. Here is an example of a data flow. Consider participants for a continuing education program like this. Now here the participant will fill up a registration form. It could be any student comparable to the old example that we gave. Let us consider this to be an example of the actual students interacting with an institution. The first thing that the validation process should do is to check whether the student has paid the fees for the semester. Otherwise you cannot permit that student to register. Notice that the registration information is quite independent of the payment of fees. The payment of fees could have been done quite independently as a different operation. But if it is not done you got to cross check. Please note that this is actually a control specification. You remember we said the data specification, the processing specification and the control specification. Now this is a control specification unless this is done do not do this. This needs to be captured somewhere in your analysis. What you see here are called data stores. So there is supposedly an account master. There is an enrollment master. I am using a terminology which is familiar to the COBOL programming terminology. But these could very well be database tables. As a matter of fact whichever name you give while identifying such stores ultimately you will transfer these in some appropriate form into the database tables. You will notice that this arrow indicates that what flows here is a registration form. What flows here is an error list in case the processing says invalid registration. Of course this would go to another process which I have just temporarily called Pnext. There will be another processing logic. It could involve printing of that list giving it to some clerk something something. You will recall that we need to transform the data to give a role list to a faculty member. So that will be another output that will come out. You will notice that each of these arrows have a label marking these arrows. These two arrows however do not have labels. Although we can interpret from the context what is the information that is flowing. For example here I might be sending the role number to the account master and I might be getting just yes or no answer. Here I might be storing all the registration information. But nothing can be left to interpretation in a formal tool. Therefore it is necessary to have such headings associated or labels associated with each arrow as a matter of fact with each symbol. This is a very important thing whichever modeling tool that you use in life. If there is any diagram there cannot be any component of a diagram which does not have a label and there cannot be a complete report unless that particular label is expanded in full detail somewhere. A lacuna left at this stage could cause tremendous problems later in the development. Here are the basic DFD symbols. These symbols for example an external entity let us say a donor to an institution is an external entity. Data flow, an account statement. The nomenclature used inside here actually denotes a sort of system analysis for capturing pledges by alumni and the actual donations given and so on. So there will be a donor as an entity. There will be an account statement which may represent a data flow going out of the system to the original donor. There could be a process which could say check expense whether the expenses under a particular project for which donations have been received are within the limits or how have they been done or they are required to be cross checked validate. So all the programming logic will come there. Pledge info is a store. A pledge means commitment to give some donation in future. So what kind of things you will keep there? You will keep probably the donor's name, some address or something. The amount pledged. The modest operandi of the pledge meaning whether he says I will give all of this money next month or I will give over next three years, one-third, one-third, one-third. Please note that this does not amount therefore directly to a table in your database. This only amounts to capturing of what information you need to store because what you are doing this is at the time of analysis. There are many times when we get confused and anticipate a table in our final application and therefore try to see whatever we put as our store in analysis to be a table. That is incorrect. During analysis you are just supposed to capture the actual information needs of the organization. So for example if you have no computerized system you will go through the registers and files and other things that you have and you will just notice how things are currently captured and that is the way you will sort of indicate that storage. It is this and in fact such storage which is of relevance to our discussion today. An external entity is a source of or consumer of information, obvious. Arrow represents flow of data. Process describes transformation although we are not going to discuss this any further but subsequently when you design and develop applications it is this process specification which will get converted into some kind of application logic. So this represents the algorithm which gets data or puts data from or to arrows and it contains processing specification or p-spec it is called processing specification. Last but not the least is the store which stores the data. I will restate that the store as depicted in this model has nothing to do directly with our tables in the design. Incidentally some stores may get automatically mapped as tables but that is not the idea here. Stores means just what information stored in what form. What is important here is to have an associated data dictionary. This data dictionary is a extremely important concept. Without a data dictionary any diagrammatic model is completely useless for real world. Diagrammatic model is good. The diagram is good for conveying in a nutshell the basic scope of things to a human reader but the details which are required for subsequent program development will come from a data dictionary. So each symbol must have an associated level rule number one. It is surprising that even in the professional organizations particularly smaller ones sometimes the data flow diagrams which are submitted do not have I have seen that myself and I have asked people to redo the work. The reason that happens is that the person who draws that diagram knows every arrow by heart and therefore assumes things. So as usual anything that is written by one individual or a group must be validated by somebody else. So this is called when you do that for actual program code you call it structured walkthrough. For other documents you call it a review. The review must be done by somebody other then. Data dictionary entries must define that particular label in terms of its metadata. So whatever is the metadata that should be defined. The process narrative must accompany every process block that you put there. So it is not sufficient to say validate. Exactly what are the logic for validation is to be done. This happens in many existing institutions where the internal process documentation is not strong. So we have seen for example that the first time computerization in many organizations actually resulted in some documentation of the process that was followed the rules that were followed. Unfortunately these were not documented in English because that time the software engineering was not well understood. So people simply wrote programs. You would know how many of LIC's operational functional rules are actually available only in the form of COBOL logic and not in the form of a English document. This is bad but there is nobody to blame because there was no such system. Ideally a software requirement specification document will contain complete description of the processing logic requirement because that is what would be implemented. So for people like us the task would now be to construct the reverse documentation and if we do not do it during the migration it will never get done. So when we migrate we might retain the COBOL code for application logic but it is important to reverse construct this kind of processing specification. So here is an example entry registration form which may contain roll number, name, course, audit, credit, timetable slot. Again just like you describe the attributes in greater details here also these are not attributes. They are a piece of information. This is metadata. But you must describe to the same extent of depth because that is what would then ultimately result in the correct description of your final model or database. Validation information, what is the input to store is roll number, return from store is name, CPI, fee, payment status, whatever. There is usually a context diagram which is called the level zero DFD. Remember I mentioned that DFDs could be in multiple hierarchical levels and that is because when the system is complex it is not possible to have a single data flow diagram depicting everything. Such a diagram will span the whole wall. So you have A4 size pages. So you first start with a hierarchically topmost level which is called the level zero DFD and this is also called the fundamental system model. There is a single bubble with incoming and outgoing arrows mainly showing the external entities involved. So hardly any processing logic here. It is a system diagram. The functional partitioning of this diagram then will be done to derive several level one DFDs and each one of the level one DFD will go into level two DFDs and so on. Here is a context diagram example for what we call our center for distance engineering education program. As I said we conduct these programs and beam those courses live through satellite to various centers. So we have remote centers. We need information about those remote centers where it is located, how many students could be there, what courses are offered at that place, what labs they have, whatever. Who are the participants on any course? So the basic information about participants which are the courses offered by our center, the information about all courses when they are offered etc. Any prerequisite for that course? What? Who are the faculty members who are associated with teaching of these courses? There could be faculty members who are simply registered with CD but they have not yet offered a course. So this could be an independent difference. Sir while drawing DFDs I want to know what should be the level of abstraction? 0, 1, 2, 3, 4, how much? The numbers do not matter. Ultimately you should be able to design your software without asking any question to any human being. Now you can understand the design. So if some figures are to be rounded off for calculation of some accounting that must be stated in the specification some. It is a good question because these level of details are often not captured and this particularly this thing is seen in in house teams because in house teams have a natural expertise on the domain. So a whole lot of things which are assumed are just in the minds of people. In a professional setting it is not permitted. So the level of detail is this. I mean always apply the thumb rule given the details which you are doing analysis. Now put a programmer's hat or a designer's hat. Can I program everything without talking to anyone from the information that I have written here? The answer is no more information is required. A typical SRS document by the way it is not uncommon that for a for even a simple system it would be about 1000 page document. For a complex system it is not uncommon to have 6 or 7 volumes each of 1000 pages. I mean there is no shortcut to it. By the way in software development more English has to be written than programs. English or Telugu Tamil whatever doesn't matter but that that is required. So good good point here but what this shows is merely a context diagram. It is just the beginning one. Then you will refine expand this to level 1 DFD add more detailed processes. Flow continuity is necessary. If you take any level 1 DFD and you showed 2 inputs from here and 1 output from here any further refinement must contain only a subset of these inputs. A refinement cannot now suddenly go back and get some more. Additional inputs and outputs may additional flows may emerge within them. So that continuity needs to be made. So expansion must have the same IO. Anyway we digress we go back to our own normalization thing quickly but I thought since I am describing the nature of analysis for the sake of completeness let me also mention that apart from processing specs and the data flow and the application functionality and along with the entity model which captures the static piece of relationship between data and data description you also need to describe user interfaces. So modeling interactive screens for example which will capture and validate input which will do query response retrieval. Modeling reports what should the format on screen on printer. Again as cobalt programmers you would know that a very large and significant time of yours is spent in writing reports or writing query. So functionality wise the more important functionality might be the complicated application logic for processing transaction. But the amount of time that one has to spend is far more in reports and queries. Then there could be some special interfaces like a PDA interface or like a hardware software interface. You know a truck goes on to the way bridge and automatically the weight is taken. So there is some hardware there is a spring loaded way bridge and the information comes out not just in terms of a pointer which you can read but also it is digitally captured and automatically transmitted. So these could be special interfaces. Screen and forms layout need to be very carefully done. Source and target stores must be defined. At this stage you will define the target store or the source store from the context of analysis. Ultimately these stores will get mapped into multiple tables or multiple stores may get mapped into a single table. And you need to carry on this connection between the forms and reports to the store to the ultimate tables because that is where the data will have to come from. There are operational considerations that is what we call user friendliness. People wrongly define operating system to be user friendly or not user friendly that is biggest nonsense. Operating system should be programmer friendly and end user should never see an operating system. End user should see a screen of your application and then it does not matter what operating system is used. Keystroke minimization is an important criteria. Number of keys that the end user has to press should be minimized. The information system design then consists of so this is the analysis part. The design part will consist of data design. It will consist of software architecture design. It will consist of interface design. You will define a user manual at this stage. When do you write user manual for your programs? Typically after you have written the program not accepted. Absolutely not accepted. In fact unless there exists a user manual a single line of code cannot be written. So you have to do a lot of unlearning for professional practice. The reason is very simple. The reason is I want to implement what I perceive the user requires not the other way round. What I have implemented is what the user will get. They are fundamental different dichotomies. So interface design and user manual design of transactions and workflow and design of tests. There are two levels at which these tests are designed by the way for the software. One is the design of themselves design test. So these tests are executed by internal developers. There is a completely different kind of test which are designed by end users or on behalf of end users for accepting your software. That activity must not be done by the development group. That activity must be done by what we call the UAT group or user acceptance testing group. We now come to the design of data which is the purpose of this discussion today. Mapping stores into SQL tables. That is what we wish to do. First of all design should avoid having redundant information or obvious reason. If you have redundant information then any time information changes. If that is redundant you have to change it at 2 places, 3 places, 4 places. Secondly the schema must permit relevant information of our application to be represented proper. This seems to be an oxymoron. Isn't that natural? Isn't that normal? Why am I stating it? While designing the table if I am not careful it might so happen that I am not able to properly represent my organizational information. We shall see an example. The purpose of the normalization theory in fact is to ensure that by its application you solve both these problems. You ensure that the design avoids storage redundancy and you ensure that all relevant information is properly represented. To illustrate that there could be something wrong we take a view that suppose we take all the organization wide information and store it in a single table. Theoretically it is possible. One file contains everything or one table contains everything. That it is not good is what we need to ascertain and then decide what should be good. That is where we use the normalization theory. So while we discuss this normalization theory today let me also point out that you might want to you might have to need design of additional tables. These could be transient tables. These could be transaction storing tables. These could be tables storing bulk files which come from some place or which need to be sent out. The conventional file system is not dead. It will never be dead. Conventional files need to be used at least at the interface of your database any external interface. So how do you design those files for bulk input output? How do you design tables for logs? How do you design tables for audit trails? These are all important issues. These are typically not covered in the domain data description part. The domain data description part will relate to tables which capture the domain related information. But this is workflow related information or process related information and that needs to be identified during analysis. There are many normal forms. You can read the book to study more fourth normal form, fifth normal form or whatever. The first normal form of a table. The normal form by the way is the property of a table. A table is in one of the normal forms. Whether you like it or not, a table will always have an associated normal form with it. The characteristic of the table defines which normal form it is. So it is important to look at the definition of normal forms. The first normal form requires that all attributes should have atomic values. That means single values, multi-valued attributes are ruled out. So if you have single-valued attributes, so this is clear first normal form. Obviously then since we have already discussed and ruled out use of multi-valued attributes, almost any table that we design will be necessarily in first normal form. It is necessary but not a very vital requirement as we shall see. Vital requirements come later. The second normal form. The second normal form definition states all non-key attributes of a table should be fully functionally dependent on the primary key. What is functional dependency? What is full functional dependency? And why should non-key attributes of a table should be fully functionally dependent on the primary key? These are the questions that we need to answer. There is a third normal form which extends the requirement of non-key attributes to be fully functionally dependent on the primary key non-transitively. We shall see what non-transitive is and you will be able to do that non-transitive dependence check to get it in the third normal form. There is a boy's called normal form or the boy's called normal form requires the non-key attributes to be fully functionally dependent not only on primary keys but candidate keys as well. We are not going to discuss the boy's called normal form BCNF as it is called or the fourth and fifth and sixth normal forms etc. In general if you get your tables all your tables in a schema in the third normal form more or less you can guarantee that it is a good design. So you need not go beyond third normal form ordinarily. We will now define functional dependencies using the notion that we put all our data in a single table. Do you recognize this table? Do you recall the join operation that we had discussed? So this is what would have come out of that join table approximate. Since we saw that we are required to do a join in order to get meaningful information why not say that whatever is that table is the table in which I store the data. Look at the convenience whatever way you want you can search by putting a index on any one of these things and you can get all the information. So what is wrong in keeping my information in such a form? Let us look at the way information is maintained here. It has roll number, name, hostel, room number, cumulative performance index or CPI then course code, course grade, course name, course credit. Practically all the information that we had seen required. So what is wrong in this? We notice a few things. Do you notice redundancy here? Name. Name for example if Sunita is registered for six courses her name will appear six times. The course name again is repeated, course code is repeated. I mean course code of course will have to repeat in the registration thing but course name is getting repeated, crates are being repeated. If there is a large information like address etcetera will get repeated in number of types. What is wrong with that? This space first of all. But suppose I say this is so cheap that I can go to Lamington road and buy. But there are processing problems for example. Before we go further we define. So we shall see what are the problems with the redundancy. There are other problems which we cannot see clearly. We shall see that when we connect the notion of functional dependency to some of the observations of operations on this table. So first let us look at the notion of functional dependency. These are all attributes. What is the key attribute for this table? S roll and C core. So S roll plus C core is the primary key. These are individual attributes. First of all the functional dependency is defined in terms of an attribute being functionally dependent on another attribute. When you say A arrow B it means B is functionally dependent on A or alternately A determines B. What it means is that whenever a value in your table any value appears for A and certain value appears for B then what is guaranteed is whenever the same value for A appears again the same value for B will appear. No other value can appear at all. So they are essentially related. They are dependent on each other. That is the definition. For example let's go back to the previous slide. Wherever roll number 5012 appears the name Sunita appears. It does not matter how many times 5012 is repeated the name will only be Sunita. There could be another Sunita with some other roll number. So the roll number is not dependent on name. But whenever this roll number occurs the name will always be Sunita. Therefore Sunita is dependent on this roll number the name Sunita. So name is dependent functionally on roll. Same thing you can say about the course name. Same thing you can say about the credits. For example there are there could be many courses with eight credits. But what is guaranteed is whenever CS413 happens the credits will be same. So credit is dependent on course code. If the names are unique the credit could also be dependent on course name. In that case course them would become something like a candidate key. It is not a candidate key here because here the primary key is of course this plus. Sir here the primary key is S rule plus C code. Yes. Dependency that value of B should depend on A. So if A is a combination of attribute. Then it should depend on the combination. No, no, no. Functional dependency talks about dependency of one attribute on any other attribute or a set of attributes. The primary key relation is not yet brought into picture. What we have defined is just the notion of functional key, functional dependency. No, no, no. We are not talking about normal forms. We are talking about the definition of functional dependency. Functional dependency can be defined between any two attributes or between any two groups of attributes. So what we are doing is we are just understanding what a functional this thing is. We will connect it up with the normal form. It is possible for example that something could be functionally dependent on a group of attributes. Okay. For example, this name S name is fully functionally dependent on S rule is also fully functionally dependent on S rule plus C code. The statement is incorrect in so far as I am using the word fully. S name is functionally dependent on S rule correct. S name is functionally dependent on S rule and C code correct. But S name is fully functionally dependent only on S rule and C code. It is not fully functionally dependent on S rule or it should be the other way. Yes. So if I have now what is emerging out of this discussion is if I have a combination of attributes and if there is a certain attribute which depends upon this combination whether it fully depends or not is determined by looking at the individual components of this combination. And if this fellow is dependent functionally dependent on any subset of this combination then it is not fully functionally dependent on the original one because it is dependent only on the subset. So this is the distinction between full functional dependency and mere functional dependency. We capture it here. For example, S H S R S C P i are dependent on S rule that is obvious. C name and C crate are dependent on C code. Grade is dependent on the composite key S rule C code. Notice that grade is not functionally dependent on either S rule nor it is functionally dependent on C code. So we say grade is fully functionally dependent on S rule plus C code because it is not dependent on any part alone. However, S C P i or C name is not fully functionally dependent on this combination because it depends on one of the two as well. This is the crux of the full functional dependency. Now you will realize as we discuss further why full functional dependency on primary key is necessary. We shall see that in a moment because in the previous table let us go back to the previous slide, two previous slides. So this is the table in which we have all the information kept. The primary key of this table is S rule plus C code. So barring these two fields, these two attributes, each one of the other attributes is a non key attribute. Now we come to the normal definition form. The first normal form there is no dependency notion. In the second normal form we say that every non key attribute should be fully functionally dependent on the primary key. In this particular table we see that it is not so. Every non key attribute, in fact the only non key attribute which is fully functionally dependent on primary key is grade. Grade is the only fellow which is fully functionally dependent, this one. But SH, SR, SDEM, SCPI, they are fully functionally dependent only on S rule. If a attribute is dependent on any other attribute, any other single attribute, they are always fully functionally dependent. Similarly credit, CNAME, etc. are dependent only on C code. So they are not fully functionally dependent on primary key. What it means is that this table is not in the second normal form. What is the impact of this table not being in second normal form is what we shall see. We have already mentioned. Let us go to the next slide now. All non key attributes should be fully functionally dependent on the primary key. Many non key attributes in our table are not fully functionally dependent. Therefore the table is not in second normal form. What we want to know is what is the problem if this criteria is not satisfied. So can you guess what could be the problems? We already talked about redundancy. So redundant data storage is a problem. But it is storage only the problem. With that I can say that disk is not very costly. So let it be dependent. The real problem is in processing. Any time a redundant value is updated I will have to update it at multiple places. So if Sunita changes her name then it will have to be updated at many places. She changes her hostel room it will have to be updated at many places. If a course credits change course name changes. Any one of those. Any where wherever there is redundancy. So if Sunita changes her hostel, hostel 10 will have to be changed, updated at multiple places. If the name of the course CS634 from enforce system is changed to something else since a course may be registered by 300 students I will have to update 300 records in my file or in my table. That means Godagriva. If this was the only reason then I would say yes it is important but it does not seem to harm me really. After all by mistake if while I do the table design if table is not in second normal form I may end up doing some more Godagriva. But it is not critical for my business. My business should run. Is that the case? Is there any other possible problem with this design? This brings us to the avoidance of redundancy is of course a critical requirement but it is not the more critical requirement. So what could be the critical? What could happen if I use this thing other than redundancy? So let us look at a after all something will happen to the information in the table only when we add update or delete. Consider a deletion. If Muthukrishnan who is registered for a single course HS412 and he drops the course. He is registered for only one course but let us say if he has some serious family problem it goes to the director of the institute says this semester okay I won't leave I have to take care of my family. He is given of course is it? So he goes back and he is supposed to come back next semester. What will happen when he drops this course? This record will be deleted. When I delete this record information about Muthukrishnan also is lost permanently because there is no other table. Is that acceptable? No. Can I still maintain Muthukrishnan's information without the course registration? Answer unfortunately is no because course code and so roll number together are a primary key. If he denegisters at most you might think of putting another value here for the course. If you do that it becomes an invalid primary key. That record cannot be kept in this table. So this record will have to be deleted. Now you will recall that I had mentioned that an important requirement of the design is all the pertinent information for the organization must be properly represented. So if I am your customer even if my policy has matured you might delete me from the policy information. Should you delete me from your customer information? Today you don't maintain customer information. You have no information about people who don't have policies. Think of the perspective of marketing. How many potential customers you have in India? 100 crore. You have no information about them. Even when you have information about someone like me who has a policy the day my policy matures that information goes. Am I not a prospective client to you? Just as is he not a valid student of the organization? When next semester comes back he comes back would it be alright to ask him who are you and then say he will say my roll number is 89004039 he says no I have no record you don't exist. So it is not possible for the organization to store the fundamentally important information properly in such a design. Is it related to the normal form? We can see yes it is. If we notice why this is happening is because all the information related to Mudukrishnan comes from all those non-key attributes his name, hostel number etc etc which was not functionally fully dependent on the total primary key it was dependent only on the roll number. So when this record got deleted he entirely got deleted. So perhaps if our table was in second normal form this would not have happened. Let's see whether this is true or not. Since we are now it is obvious to us that this table is not in the second normal form we must decompose this table. Normalization theory by the way is not merely for theoretical analysis it also provides a way of design. If a table is not in normal form you can decompose this table into multiple tables such that each one of the resulting table is in normal form. How do we decompose? Let's say we don't know about Muthu or the semantics of the information but we only know the functional dependence. What are the functional dependences here? The name, hostel, room number and CPI are all functionally dependent on one part of the key S row. Course name and course credits are dependent on another part of the key C code and grade is dependent fully functionally on the composite. Now we say that we will separate out things. Separation out is not taking it out. Separation in case of design consists of projecting the table. You know the project operation. So I do a project operation on this table and I project only those columns which contain attributes which are fully functionally dependent on a part of the key along with that key. That becomes one table. I do the same thing with another part of the key. Then all the attributes are fully functionally dependent on the primary key. I project them into a third table. Now you can see that each one of the resulting table is fully in second normal form. Project S roll and attributes dependent on it make one table. It will become a table with primary key as S roll. Project course code and attributes dependent on it, the primary key will be course code. And project S roll and C code and the dependent attribute grade, the primary key will be S roll and C code. This decomposition you will recall was actually the table design that you would have got had you followed the ER model. You are an entity student, entity course, an entity registered form. And these were exactly the tables that came out of mapping those entities into table design. But that is incidental. In fact we only note that if we have used ER model as a starting point, we are likely to get tables which are in second normal form if we map those entities or the association sets relationship sets into table design. A lossless decomposition must be ensured. Which means we must get back original information through joints. So if I have no ER model and I am following this, what it means is that if I have these 3 tables and projection means I have projected the data from that original table into these 3 tables. Now if I join these 3 tables together appropriately, I must be able to get the original table that is called a lossless joint. Many of you may feel that this is obvious. So here the lossless joint does not really mean dropping of data from tables. If the decomposition is not lossless and you have these multiple tables which you have decomposed the original one into and you do a joint, sometimes you get extra information which is not there in the original table. If the decomposition is not properly done, that extra information is also called a lossy design. Please note that loss of information does not necessarily mean dropping of some information. Adding spurious information also constitutes loss of information. So the lossless joint is to be interpreted in that term. In general, then we use the ER model and derive our table definitions from that model. We would often get our tables in the third normal form. But we must cross check. Each table that you design should be examined in this form. In general, when you do a system analysis, you will be reprinting stores, you will be reprinting entities and you will be making a data dictionary which will contain all your metadata. You will agree that that metadata will contain all attributes that you have. Ideally, before proceeding further during the type of analysis, you should identify functional dependencies across the whole corporation and should clearly have a list of each attribute which not only says what that attribute is, etc. But what it is functional dependent on? Is it dependent on one attribute, two, four, five, etc.? Clearly, you must focus on functional dependency on key attributes. If you have followed an entity model, then you would have identified primary key, candidate keys, etc. It is important to try and identify such functional dependencies at that point. And you can do that very easily by, even if you do not know the domain, by looking at the typical values which are stored against those entities. You can generally figure out, you can ask questions, you can think and you can apply them. We will now construct an example of transitive dependency and we will show why third normal form is essential. That we will do as an exercise here. As I mentioned, there is a voice called normal form or BCNF which insists on dependence on or it is defined in terms of dependence on candidate keys. There are many more interesting aspects of the normalization theory. If you read the corresponding chapter from the Sudarshan's book, you will also find algorithms to determine the normal forms. For example, there is a client closure algorithm. Basically, if you have a list of functional dependencies, there are tools to which you feed these functional dependencies and it will give you alternatives for third normal form design. So if you are starting afresh, that could be one way of doing things. It is not ordinarily done because functional dependency sadly has not been a concept which is rigorously used during the analysis phase and therefore that option is not available during design. But if you want to re-engineer a cobalt system, for example, and convert it into a complete database system, you have to ensure that all the variables in your cobalt programs which represent fields of your files, those are the stores. Then the functional dependencies of those fields, not only on the primary keys of the file that you currently have, but on any other primary keys or components of primary keys which are elsewhere or composite primates. That exercise is a fairly large exercise. Unless you do that, you will not be able to guarantee that your final stores are in third normal form. So this is some exercise that should be undertaken. Now let us look at the this is the reference of a book by the way, Software Engineering, A Practitioner's Approach. Remember I mentioned the book by Roger Pressman. This is the website so you can possibly access the slides or something. Just like most of us treat Sudarsha's book as the Bible on databases, most software engineering teachers, students and professionals treat this as the Bible of software engineering. So it would be a very, very useful book to have which describes complete end to end the entire software engineering process, including of course the design and so on. So we will now go back to constructing an example here. Let us say I am designing a table for a material management system of an organization where I have various parts or sub-assemblies. I have codes for them. These parts are supplied by suppliers who stay at various places. Then there are quantity at hand available, etc, etc, etc. I have originally started with a table which has part code, supplier code, user factory code, whatever, whatever, everything. And I figured out that jumble is not in second normal form. So I have removed everything. And I have reduced one of my tables coming out of such projections which has supplier code as the primary key. And this supplier code supplies certain item. And there is a city there. And there is a rebate available for that particular city depending upon the distance of that city from my operational location. So for example, if somebody is in let us say Delhi, I might say the rebate given to Delhi because Delhi is a very far off place. And unless he offers me a rebate, I will not be able to purchase it because there will be heavy transport cost. So this rebate may be as much as 25%. Let us say there is somebody in Pune and the rebate may be only 5%. So this is supplier X, this is supplier Y, this is supplier Z. Z is also from Pune. He supplies some other part. For that moment we assume that each supplier supplies only one part. So he supplies item 1, i1, he supplies i2, he supplies i3. And of course, this supplier Mr. Z will also give me a 5% rebate. These rebates I have standardized depending upon the distance of my place to those cities. So let us say I am in Mumbai or some place. So I expect these rebates. Everybody who gives me these rebates, I will do business, otherwise I will not do business. If supplier code is the primary key, what are the functional dependencies? Supplier code, functionally it determines item. Is that okay? In fact, that is the primary key. So it also determines city. It also determines rebate. There is no problem in that. Because wherever a particular supplier code comes, the rebate, in fact the key for this could very well be supplier code plus item. Because multiple items may be supplied by the supplier. So there could be supplier code plus item. In which case I will rewrite this to mean SUP code plus item code defines rebate and SUP code plus item code defines city. Is that okay? So if this is my primary key, then city and rebate, both are fully functionally dependent on this. There is no problem. This is in second normal form. What I meant by transition dependency is that rebate is also functionally dependent on city. Is that correct? Because I have decided this rebate figure based on the city, the distance of the city. Now look at the same problem that you have. First of all, redundancy, which is not so much relevant here. But what is important is, suppose for some reason this X goes out of business, then the rebate for Delhi, that information is lost. Some material manager when he identifies a new vendor in Delhi to supply something, we'll have to figure out what should be the rebate for Delhi. Now he might certainly say only 4% jaleega or he might say 10% chahiye. Whereas the standard practice of the organization was this. So notice that because of this transit dependency, I have a similar problem of not being able to represent properly all the information relevant. This is happening because rebate is fully functionally dependent on supply code plus item and is also fully functionally dependent on city, which in turn is dependent on supply code plus item. Since this dependency is like that, that is the dependency which we refer to as the transit of dependency. Oh sorry, you can't see this. You note the word. So this rebate, I will say, so rebate is transitively fully functional dependent on supply code plus item. And I don't like this transit dependency because it is creating this problem. So what is the answer? Simple. Project it out. But you can't project out city and rebate out of this. You have to return something here. So in the transit dependency part, you first identify what is the individual dependency of the components that you are concerned with in the transit dependency. City and rebate therefore must form a separate table. You agree that whenever I put the city and rebate like this, then it will solve the problem of losing out information. Delhi will always remain here with 25, Pune will remain here with five. In fact, I can go to Madras, Jalsukuda, Goahti, all the cities I can define all the rebates here. But whenever a particular supplier comes, the supplier item information would be related only to city. So there are two projections that I do. One is I take the supplier code item and that particular component on which somebody else is transitively dependent or through which somebody else is transitively dependent. I take that and project it here. And I take this particular element and the attribute which transitively depends on this into a separate table. You agree that these two tables will solve the problem exactly the same way as the second normalization solved the earlier problem. This then is the importance of transitive dependencies. Please note that these examples look very simple because there are very few attributes here. You may generally have to contain with a total of something like 2000, 2500 attributes, may be more. And you never know which attribute is related to which in which way. So this is not a trivial exercise at all identifying function independent. But you can also notice that it is an important exercise. Without that you will not be able to guarantee that your design is good. So there are tools I do not know whether LIC has access to those tools. Does rational have anything to do with normalization or functional dependencies? We don't know. You may cross check that. There were tools specifically for re-engineering which had this attribute that they could actually get as input the functional dependencies and prepare a what you call different sets which are like your table designs that you take these attributes and this these attributes and this etc. etc. Anyway, we'll stop here as I said.