 Dear students, I hope all of you are fine. Today, we are in the lecture number 23 of the database management system course and you know the course code is CS403. In the lecture number 22, the previous lecture, we were discussing the physical database design. This is the stage when we are at the stage of implementing our logical database design. We have got our normalized logical database design. We have passed through all the phases, say initial study, then conceptual database design, then logical database design, then normalization. But now is the stage when we are going to implement our design, our database practically. At this stage, our concern is to include the efficiency factor in it. See normalized relations, they give you an efficient design, fine. They give you an accurate, a technically sound design, that is fine. But when you are at the stage of implementing that design, now you think once again that now practically the implementation stage has come. At this stage, if you say you want to make it efficient, you have to negate it. Why? Dear students, I also mentioned in the previous lecture that this stage that we are studying nowadays that we are studying in this lecture, here you will get different options and mind it, they are options. No one will ask you to adopt a particular option. It has to be your own decision, your own analysis. You use the requirements base, the working of the database, you decided what particular option to adopt. For example, we studied one thing as coding. Coding is an option. No one will ever ask you adopt the coding or you should adopt the coding at this particular field or this particular attribute. No one will ask you. It has to be your own decision. If there is a significant difference between the storage that has been consumed by the database, then rather than storing the attributes as a whole value, you store the course. Again, the situation is that the domain of that field should be relatively small. So, this means that the things that you will see in the future, please do not get confused about what we are doing. You are just studying different options and the advantage of them is that they are studying. At every stage, when you are designing a particular database, you should look at every stage and see if you should implement that step in the physical database design. What difference will it make? What situation should we do? So again, this is your decision that you have to make. And in this, with the passage of time, with experience, you will feel very easy in making this decision. This is not a dangerous thing to be afraid of. But at this stage, it is important to know that there are different options that you can adopt at different stages. In the previous lecture, in the physical database design, we studied data volume and usage analysis. And secondly, we studied how we design fields. And on that, we studied how to allocate and assign the data type. And in addition, we studied the implementation of different integrity control. Today, we will see where we will start. Physical records and denormalization. For example, I said that you have got your logical database design, rather you have got your normalized database design. Now from here, when you are implementing it, when you are practically defining the record into some tool, into some DBMS, then these will be your physical records. So here, on a bus stage, you will adopt a concept, a practice, that is denormalization. It is strange that just at the previous stage, there was normalization, where you normalized the tables. Why? To get a smart and efficient design. At this stage, we are saying that now we are going to denormalize them. Again, mind it, two things. Efficiency purposes, where we feel that things will become efficient by denormalizing them. And secondly, again it is an option. No one is asking you to do that. Mind it. Now denormalization means that you denormalize and unnormalize the tables. Why? How? Let us see. In logical database design, we group things logically related, accessed through the same primary key. The basic approach was that we combined those things in the form of a relation, or in the relationship. And those things were normally accessed through the primary key. So now, in physical database design, fields are grouped as they are stored physically and accessed by DBMS. What is the difference? There is a link between logical and relationship. Those things are accessed through the primary key. That is the main concern. And you group those things. Here, you have to store them physically. In that way, you will combine them. And in that combination, in some cases, you will deviate from that that what you designed in logical database design, in some cases, not in most of the cases. What are the situations? When we deviate from that routine, that approach, why? For the efficiency purposes. In some cases, for the efficiency purposes, we have to denormalize the relations. Process of transforming normalized relations into unnormalized physical record specifications. What is denormalization? It is its definition. Process of transforming normalized relations into unnormalized physical record specifications. So, you will adopt this process during your physical database design. In general, it may decompose one logical relation into separate physical records and combine some and both. What will happen in denormalization? In logical database design, when you have created your own relations and normalized them, you have got no different relations. Thus, 15, 20, 50, 100, whatever depending on the size of your system. So, you have different number of relations. Now, the way you related them to their functionality and logic, you have divided them into different relations. What will happen in denormalization? You will mix some of them together. When you mix the normalized relations together, they are basically in an unnormalized form. So, you have denormalized them. Similarly, you may have got a single relation which will decompose the logical database design. So, you have denormalized them. Or there will be a combination of the two. You will break some of them together. This is how you will do this activity in denormalization. Again, I am telling you again and again that this is an option. You are not forced to do that. You will not be asked to do that. You will feel that if I do this, the efficiency will increase. The symbol of efficiency or data processing efficiency is not a complex thing. You can simply understand that if user queries are being responded relatively quickly, in less time, if the user queries can be processed, you will say that it is okay. It has become efficient. Even if you have to do more work as a developer as a developer, but if the response time for the user is increased, it means that you should take that decision. Now, here you will see three different situations where we should think that let us denormalize the thing. For example, merge two entity types into one with one-to-one relationship. For example, it is written that we have two entity types. We are talking about the conceptual database design. When we transformed them into the logical database design, you will remember that when there is a one-to-one relationship, what happened is that the participating entity types became relations. And the primary key of any one was added in the other form. We have already read which key it was. Even if one of the entity types is optional, so joining can lead to wastage of storage. However, if two access together very frequently, then merging might be a wise decision. You have one-to-one relationship. You have made two tables in normalized form. The compulsory side you had, the primary key was added in the optional form as a foreign key. Now, you had two normalized relations. Here, they are saying that you should merge them in that place. The two of them are one-to-one relationship. You should merge the two relations. The relations that were formed because of one-to-one relationship, you should merge the two entity types and form one relationship. What will happen is that if you have an optional, a compulsory one, like in most of the situations, then the option that you have when you merge them will be in the cases in which the optional side does not exist. In those cases, you will have empty fields with null values. When there are null values, it means there is a wastage of storage. But it is said that if there is a lot of joining between them, then combining them is an efficient decision, a good decision. Why? That is why, when you have accessed a table, now you want to access the related data that is stored in another table. In that case, you have accessed a table in memory. Now when you have to pick the related data, you will have to perform another read. You will have to go to another table. You will go there and pick it. Then you will find the related data and access it there. Then you will display them together. As you can see in the example. But if you join them, merge them, the advantage of that is that both fields are accessing them in the same record, in the same table. This will happen. The other thing is that the related attributes you do not need to find them in two separate tables. Rather they are there in the same table. So based on these two advantages, your efficiency increases in the access of the data. For example, in front of you, in the slide, you are saying a one-to-one relationship. There are two entity types here. Student or scholarship application. The scholarship application side is one option here. This means that in the real-life situation, not necessarily every student will apply for the scholarship. So this side is optional. And obviously the application should be linked with some student. So that side is compulsory one. In this situation, obviously if you want the student's data and also the application data. For example, if you want student's address and its application date and qualification. In this case, when you implement it practically in these two separate tables, as you have done it in the normalized tables, that you have made these two tables, the problem will be that when you want student's address, his qualification and application address, you will have to go to the student table and read one, pick the data from there, and then you will have to read the student's ID and then go to another table. You will have to go to another table. From there, you will read the table again. And then you will search for the relevant data in this table. There are techniques that make the search very efficient. In one-to-one cases, the searching becomes quite efficient. But still, you have to go to the second table and you have to read that. Now if we merge these two, we will do we make a single table. In this table, we add a single relation of 2k. In this, we put the student ID as primary and campus address, application date and qualifications. Now the application ID here, if we need it as a data, then we can store it. But if it was being used only for the identification of the application, in that case, you don't need to store the application ID. So it's up to you or it's up to the requirement that now in this case, when you have merged these two relations, in this case, you don't need the application ID. In other cases, the student ID was used as the primary key and the rest of the data and the rest of the attributes of the scholarship application were made into a single table. What is the loss of this? The loss is because it was the situation that not every student applies for the scholarship. This means that when you enter the data, to those students in this merge table, those students who have not applied for the scholarship, for them, the attributes related to the scholarship application, they will be null. So this is a wastage of storage. Fine. Agreed. But see the advantage. Advantage is that when you want to access both of them together, the student data and the scholarship application data, both are there in the same table. Now where do you see what species you will make this decision? The first thing is that if those people who do not apply for the scholarship, their number is very low. You say that the students who apply for the scholarship, they are too much, or most of the students apply for the scholarship. This means that in this case, the null values of yours will be very low. For example, if we have thousands of students who we have data stored in, and 700, 750, 800 people who apply for the scholarship. This means that most of them, they are applying. So the data wastage is not there. That is not that much. Okay. There is a loss, but there are so many records where the null is not there. This is the situation. In this situation, we should consider merging them together. One thing. Secondly, if both of them, the student data and the application data, if they are accessed very frequently in many applications together, in that case, even if your wastage is too much, even then, you can think of merging them into the same table. Why? Because your maximum application, which is running in many applications, and they are accessing them together, they become efficient at the expense of storage. If this is not a big story, or even too much, if it matters, and even feels satisfied with the customer, and the customer will not tell you directly that you can spend money on it, or calculate it, or you can take feedback from it, in any case, if you feel it counts, its values, or on what basis, that both of them are accessing in many applications, then go for joining them. I hope you understood this situation. Let us go ahead. Let us see what is the second situation where denormalization is recommended or is suggested. Many to many binary relationships mapped to three relations, queries needing data from two participating data types joining of three relations, that is expensive. Let us see one thing, that when we will read about query processing at some stage, then we will see what is the cost of executing different operations. We have read about joins. What joins is that from two different tables, the data is combined with the related data, and the relationship is established on the basis of the same value of the common attribute between two tables. This is a join operation. A join operation, from the execution point of view, is an expensive operation. It takes time. It takes resources. Why? First, you have a table of a student, and against that, we have courses. This is the courseman role. If we talk about two tables, what will happen in that case? If you have stored the foreign key, the primary key, you will first take a record of the foreign key, and with the foreign key, you will try to match it from the other table. In this way, you have to access the other table and search for the value. It takes time. In bigger tables, it can be quite expensive time-wise. In this way, a general rule is that a join is an expensive operation. It takes resources and time. Now suppose that we have a many-to-many relationship. As I was reading the rule of mapping, the participating entity is in the binary case. We transform the participating entity into the relations. We also implement the relationship of the relationship as a relation. Generally, the primary key of both the relations is used as a primary key of this third relation. The first thing we have is that we have a student course, and there is a relationship of many-to-many. Now you have three relations. First, we want which student needs the student's name, and which courses are enrolled in. Again, student, courses, many-to-many relationship. We have transformed the student's relationship, the course's relationship, and the relationship of many-to-many between them. Let us say, an enrolled relation. We have three relations. We need student name and the course's name. Now what is happening? How will this happen? First, you will take the student's ID from the student table. After that, you will join it with the enroll. In enroll, the records in which the student's ID and enroll records are found, you will identify them first. After accessing them, you will get the courses ID from the enroll. Because courses ID, and enroll is a relationship table, which was made by the relationship. There will be both the primary key, both the students and the course. Now, the courses in which the student's ID and the course are both present in the enroll. So, first, you will take the courses from the student. Then, you will take every course and merge it with the course. So, you take the course ID from the enroll, and merge it with the course. Then, you will get the course's name. This means you will have to join three tables. Students with enroll and enroll with the course. So, you will have to join these three tables. This situation is a recommendation to avoid it. How? The relationship created against a relationship is merged with one of the relations created against participating entity types. As I just told you, the third relationship you created which we have enrolled, you merge it with one of them. What is the advantage of this? The advantage would be that rather than joining between three tables, you will have to join two of them. Now, you will have to join just once, between two tables. Because we have not got three tables now. Now, we have got two tables. And you will have to join between these two tables. This difference is made in this situation. But, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, but, against, we created the third table, work. Because we need work, we have got the situation that one employee can work on many projects, on one project, many employees work. So, this was made into many to many, and that many to many relationship against your work table was created. As you can see, the employee ID and project ID are primary key of work, and date hired or salary. These are the descriptive attributes of this work table. This is our normalized relation. And this is the same situation that when you want a certain employee name or a certain employee to work on which project and which date was hired. Now for this thing, you need to merge three relations, two joints that are involved. When we denormalize it then the rule is that you merge the work relation with one of the two relations. In this case, we have a small project relation, and since there is a repetition in this, we merge this with the work relation with the project relation. With that, you will have two relations. One will be your employee and the other will be the project. Now, the employee's data came into this, the work data came into this. What is your table? Your project table has project ID, project name, employee ID, date hired or salary. And this table, which is your project table, now the primary key for this table will be project name and employee ID. And if I ask you which normal form this relation is violating, clearly second normal form. Why? Because your project name is being determined by the project ID only. What does this mean? You have that violation of second normal form so we can say this is an unnormalized table. What is the benefit of this? The benefit of this is that the time you want that the employee is working on which project and what is the name of the project and which date was hired on that project. You simply take the ID of the project from the employee table and you take the join yet you merge it with the project. Now the project table contains the project data also the hiring detail of the employee on that project. So you will get two things from this table. The name of the project and the employee who was hired and salary what is the benefit? The benefit is that now you have to join two tables rather than three tables. So that efficiency is of great use. What is the loss? The loss is that whatever second normal form anomalies were, now they are there. There is redundancy in this because the project name is being repeated. This means that if on a project 100 employees are working, then the project name is being repeated 100 times. The inefficiency is there. But look at the advantage. The advantage is that the very frequent queries on which the employee is working on which project and which date was hired and what is the salary. Now that is being retrieved very efficiently. There is redundancy, insertion normally there is, deletion normally, updation normally but if you see the situation that normally updation is not frequently done. For example, the project name, you say that it is not frequently changed. So if it is done then you know that you have to do it all. Similarly, if you see that this issue has come forward, a very serious thing is that you cannot enter a project into this table unless an employee is assigned to the project. So knowing the drawbacks then you see that the advantage does it outweigh the drawbacks? If the advantage is more and drawbacks is less then you should go for that. So this was the second situation in which we should do the denormalization. Let us go ahead. Denormalization is situation 3, reference data. One to many situation when the entity type on one side does not participate in any other relationship then many side entity type is appended with reference data rather than the foreign key. This situation which we have mentioned here is from the class that we have read about coding it also covers the coding situation which we are going to negate. And this very frequently happens in the normal logical database design. You do that instead of having a table on many side on one side you repeat the value of the table on one side and put it on many side as a foreign key and it also covers the reference. So in some cases you can see that if your reference table is not involved in any other relationship and its purpose is to work as a reference in your table in that case you can put that reference table through id and put the data into the main table. You can see the student and the hobby as an example. We have already seen that if you have got a student table and one student can have one hobby, so one hobby can be obtained and can be acquired by many students. So this is one to many relationship. First we have done this routine, we have talked about coding. We have said instead of placing the hobby there we were placing just their codes. Now we will reverse this because we will have to balance that what is the drawback in coding? That we have to access two tables and then we have to join them to find the value. So here we will do that we place the value straight away. Now the loss is that your storage wise is getting wasted because the values take more space as compared to the advantage that rather than accessing two tables and merging them to get the required information, you get all the information, all the data in the same table. So this is the situation where we merge one to many. Dear students, we have studied different options of the denormalization. I will say again that always see what are the advantages, what is the benefit that you are getting from the denormalization and also calculate the drawback, the extra work the possible types of error that could happen due to this denormalization. Now compare both of them. If you are getting benefits from this denormalization, if it is more and its cost in terms of extra processing, in terms of extra storage or in terms of risk storage, it is fine that you will manage extra space in that but the extra risk of this inconsistency, that you have to manage. There is a duplication drawback, partially operation, all those are drawbacks of denormalization because an unnormalized relation may always introduce the inconsistency. Efficiency is a very important factor in any system development. But before that, that factor that is accuracy, the consistency. You cannot say that on the cost of consistency, you introduce the efficiency. No, it is useless. If the system is not working well, what happens if it is efficient? That is why always keep this thing in your mind that your system should work well. After that it should be as efficient as possible. That is why denormalization is a possibility in which you have seen three situations. And you should see in which case you should denormalize. Do not say that in every case you are going for the denormalization. No, analyze the situation. Let us move on. Partitioning. Partitioning can also be thought of as a form of denormalization. But what you used to do was that you used to merge the tables. Here it will be that you will divide the same table and partition it. See what happens in this? Denormalization leads to merging different relations whereas partitioning splits same relations into two. In this, that factor that duplication is taking place will not happen. But in this table when it breaks, we will see the reasons why two types are possible. Horizontal partitioning and vertical partitioning. First of all, what is partitioning? Partition means you can call it in simple words parts. Parts. You divide one table from one to two or more than two tables. You divide one table from one to more than two parts. First if we had one table, it would be two or three or whatever you want to partition it. Partitioning is that you divide the same table into different tables. There are two possibilities for this. Horizontal and vertical. Horizontal means you are doing it row wise and vertical means column wise. Table is split on the basis of rows. If you have a table on the basis of rows, you have to make a record of one lakh. If you make a partition of one lakh, you have to make a partition of 30,000 to 30,000. If you make two, you have to make a partition of 50,000 to 50,000. You don't have to make a partition of 40,000 to 60,000. Horizontal partitioning brings efficiency. It brings efficiency and it takes less time to process a large table to process a small table as compared to a large table. To process a small table, you have to do any operation, you have to join, you have to do any selection, you have to do any searching. There will be less time as compared to a large table. The second thing is that you have to see if partitioning is possible and you have to make a general rule that a small table that matters both the size and the width and the width that matters both the size. A general rule is that processing a small table and doing some searching is efficient because it is a general rule. After that, every partitioning has further requirements. On its basis, you will decide which type you should partition. The idea is that it is more efficient to process a table with a smaller number of rows than a large table. It also helps in the maintenance of tables like security, authorization, and backup. Security is that if you want we have to give some tables access and some on the other table. So, in this way you have to first say that we have 1 lakh records. Some of them we have users who just have to first say that we have to give them access rather than placing them all in the same table you have to multiply them and give them access to that partition where you have to record the other group by placing the other partition you have to give them access. Security wise, another thing is in case of failure if you have a large table if you have a flop or damage you get a bigger loss. If it is in small partitions if your partition is a failure relatively a smaller level of loss. If you look at authorization if you talk about the user access partition wise it will be easier to manage their authorization. Then this is about backup you can also take the backups I will read the backups in detail later so you can take the backups more efficiently even can be placed on different disks to reduce disk contention as I said we have to first we have 1 lac record and at the same time we have to first access 15 users simultaneously this means 15 different users are trying to access the same disk so the response time or the utility of the disk increases so the rush on the disk the contention situation to reduce it you can also do that you create the partitions and place different partitions on different disks the curiosity of the users instead of going on the same disk nor they can be divided on to different disks this can also benefit the partition the horizontal partitioning can be on different types here I am mentioning some general types partitions and this varies depends on whether it is dbms or dbms it is possible that if you have a tool that varies but this is general which is generally found in dbms on the other hand there is a range partitioning this is straight forward you can also fix the range the range is you place a check you place a condition on the value of some attribute because the horizontal partitioning will always be row wise horizontal partitioning means row wise you will divide the rows of a table in different partitions the range partitioning means whatever data you have you place a check on any attribute and generally you implement this check or there can be other ways but you place a condition that the value of this attribute the value of this range will come in this partition the value of this range will come in this partition you have to define two partitions you have to do three or four partitions so it is totally up to you the advantage of that when you give the range partition the different rows that you have in different partitions the partition of individual table has reduced but at the same time if you want a query that spans over the entire table you have a query so most of the applications will say these 30,000 will run on this 30,000 to 40,000 so for that you partitioned them but there is a query that spans upon all these rows it runs on 1,000,000 records so there is no problem whenever you want in such a situation you can view them as a single table the idea is to create a table and also define its partition and if the partition is based on the range then the range will be defined now when you will apply a query on the table when you give the table it will automatically join and it will run on the entire partition but when you want to apply a query and an application and a particular partition just name it so this is called range partition this can be an example for example if you talk about student ID you say that the students whose ID is from 1 to 1,000 they come to partition 1 whose is from 1,000 and from 2,000 they come to partition 2 and first they come to partition 3 if your student ID ranges from 1,000 to 3,000 so in this way you will create 3 partitions so DBMS automatically will identify that this is the key value which you have determined based on that the DBMS will automatically place it into the appropriate partition another example of this can be age you can say that the people in the age group of let's say 15 to 20 they are in one partition people from 21 to 25 and like that the range partition whenever you enter the record you do not need to worry in which partition it is going you simply enter the values the DBMS will automatically place it into the appropriate partition and if you want to do it on all simply the table on it it will automatically span on all the partitions dear students I hope you have cleared the range partition in the range partition the problem whose chances are that your partitions they may become unbalanced if you do not have control that what is your data enter from what nature from what style then you may have a problem that your one partition is heavily loaded there are a lot of records and there are few for example if you say that we have an employer range from 1 to 30,000 partition 1 from 30,000 to 60,000 partition 2 and 60,000 on and 1,000,000 in partition 3 you may have from 30,000 to 60,000 in that there are 25,000 records that you have entered so many records there were 2 records because you have no control on the input of data which flow is entering the flow has become in that partition there are many but the other partitions from 1 to 30,000 it may be 2,000 and the next one from 60,000 to 1,000 it may be less so this is a possible problem with this range partition is that it may make, it may turn the partitions unbalanced so it will not give you overall efficiency because the applications you are are heavily loaded on the partition will slow down and the ones with less records you will get the answer quickly so this is the drawback and again you have come to know what are the advantages but also what is the drawback now see what are the other types of partition there is another type of horizontal partition in this you have the drawback of the range partition that your partitions can be unbalanced in terms of volume in terms of load to balance it we have a hash partition you might have studied about it in the data structures but if you do not read the hash the idea is that you have an attribute on a value that you apply an algorithm some programs apply an algorithm from that algorithm you get a value randomly based on that you make a different decision you place things you identify them the idea of hash partition is that rather than defining a range in this range in this partition what you do is you do hash partition what you do is you define an attribute that you identify a particular field that you will partition it but when partitioning is a type that is not the range partitioning rather it is hash partitioning and dbms itself now you do not know you do not need to know what is the hashing algorithm how does this hash apply in dbms you do not need to know simply you have this attribute you partition it and hash partition dbms will apply an algorithm and based on that algorithm dbms will decide which partition should go during this placement when dbms is placing the record the different partition the dbms is also taking care of the balance of the partitions your partitions will not have much unbalance they will have almost the same number of record mind it do not think that they are placing one by one or they will always be exactly balanced like 15, 20, 30, 40, 100 there will be a huge difference but there is not much difference in efficiency as much as you have range partitioning that is 25,000 and 1,000 now this is too much difference so your hash partitioning there is no such difference there will be no such difference rather it will keep it balanced and you will get an advantage your performance different partitions will be almost balanced it will not be unbalanced like there is possibility in range partitioning third is list partitioning list partitioning is that you to specify the range of hash use specify certain values that this attribute that this value comes in this partition in this specifically for every partition specify certain values that these values come in this partition list partitioning in this as you see as you can define a range then you cannot give range you do not need to give range on the other hand you talk about color you say if we talk about vehicles we talk about car colors we say red and black they come in this partition in which vehicles are doing vehicle color we have a category vehicle color red and black partition 1 and if it is white and gray partition 2 first if it is blue partition 3 in this you have specified a list for every partition that is called the list partitioning now we had 3 types of partitioning and some dbms their different combinations they also support when you use particular tool then you will see which type of partition they support but again this is now your job that you have decided the first question is that we should not partition so that is the situation when the table size is too large and when it is slow on different application and you have a possibility that you can divide it horizontally if you can place different partitions then you will say you should partition further decision can be made if I can I can place different partitions on different disk then it will control the disk convention so the first thing you have to see I should not partition this is the first question after that you have got different partition types that as we have seen range or hash list or any combination then you have to see how is the data flow how is it coming then you will decide if you know that the data is evenly distributed then you can go and plus you have got a form of range you can specify the range based on that the flow and data is even then you can arrange it but as we have discussed that you cannot give range then you have got the hash possibility and also you have got the list possibility to specify a range in case of individual values then you can see that these are all decisions and this is the efficiency of your ultimate data process so this is the decision you have to make as a developer as a designer Dear students in today's lecture we are discussing physical data based design one is denomalization and other is horizontal partitioning in denomalization you have seen the normalized relations that you got after denomalization you denomalize them or split them although the accurate approach is that they should be normalized but for efficiency purposes we have merged them and the second thing is partitioning partitioning means that you split the same table horizontally and the vertical partitioning we will see in the next lecture but the last thing is that all these are different options and you have to select whether you should do this or not that is all thank you for watching see you next time bye