 Assalamu alaikum. This is lecture number 22 of the database management system course and the course code is CS403. Dear students, in the previous lecture, we concluded our discussion on the normalization by discussing a comprehensive example of the normalization where we performed the normalization process on iteration. We also studied how to form, how to identify different types of FDs from the given scenarios from the given business rules. Then we used those FDs in the normalization process and we transformed a relation into the third number form and we also checked the BCNF for those relations. After that, we entered into the next phase of the database development process and that was physical database design. We just touched and introduced the topic. Today, we start our discussion on the physical database design. So, let us start today's lecture. As I mentioned in the previous lecture that one of the inputs that we need for the physical database design is the data usage. So, this is a technique that we use for this data statistics. This is called data volume and usage analysis. As the name suggests, this particular analysis shows the volume of data and the nature of use of that data. And basically, this is sort of an aid for the final physical design about the size and the nature of accesses, the number of accesses of data. Let us see what is it. Statistics about the size and usage of data plays a critical role in the data processing efficiency. Yes. And this is the final step of the logical database design or the first step in the physical database design. If you want to associate this with data volume and usage analysis, means that when you transformed your conceptual database design, then you should do the same conceptual data design that was in the ER diagram. With that, you can add these statistics or when you enter this phase in the physical database design, you should refer to the last one and take the conceptual data design and add these statistics with that. So, this will ultimately help you to design your physical database design. The statistics are collected during analysis phase from the user of the system, may not be accurate, give the tentative and relevant figures. We estimate the size of the data. Now, if you are interested in this question, how can you tell that you have not yet designed the database? The design of the database has not yet been implemented. So, how can now we tell the size and usage of the data? The answer is that you gather this information from the users. The real-world system, the people there, they are not required to be 100% accurate and perhaps they cannot be 100% accurate. But the thing is that you should have an idea and it is obvious that you are not interested in this idea, rather the users of the system who are working with the system for quite some time, who are completely familiar with it, they are giving you these figures. So, it will not be completely wrong, obviously. And then there is another thing that you do not need even the exact figures. You should have an idea, one thing. The second thing is that the relevant figure in this, if we have these entity types, then you should have a relative figure of how many are there and how many are there. This will be more heavily populated or there will be more accesses to the other. So, you will get a relative information, if you get that too, then you will have an idea that when you have to store these things physically, then the efficiency of these things, you need to concentrate more on them. Look, for example, if we talk about an exam system, then if you are developing for an institution, you ask them to do the first thing, you ask them to do the student table. You can ask them, how many records will be recorded in it? Now, one of these things is that the current situation of the institution today, for example, they say we have 1200 people at this time. So, you will not only consider the current statistics, current data, but you should look at the growth ratio in it. Because the future needs in this analysis, the future needs to be considered as well. So, the first thing they say is that we have 1200 students, but every year you say that about 100 people pass out and about 150 new people enter. You will see that every year, on average, 50 people are increasing. So, the next few years, from 4 to 5 to 6 years, from 6 to 6 years, you estimate that how much will be added to it. So, you may say that from 6 to 6 years, it will be around 1500 people. So, the goal of this much is that you can record from 1200 to 1500. And first of all, take the average, you suppose that we say that it will be 1400. After that, what you have defined a table, what you have defined a record, when you take the size of the individual attribute, you get the size of one record. So, how many records are you looking at? First of all, it is 1400. What is the size of one record? Let's say that it is 250 bytes, it is 250's size. So, you multiply it by this, then you can estimate the size of the table. You can use this as a way to store it. How will you store it on the medium? How will you store it? What will be the file organization? So, this kind of data information helps you in your final design. Let me show you an example in which you will see how data usage and volume analysis are represented. The diagram in front of you is an ER diagram. And it is your conceptual data-based design. If you look at this, the entity type has two types of parts, manufactured and purchased. Similarly, we have an entity type here, supplier. Supplier has a many-to-many relationship with the purchased part. So, the quotation is shown as associative entity type. So, this is basically a many-to-many relationship between supplier and purchased. If you look at this, the entity types in these rectangles and the values and numbers represent the volume size. Or you can say that they represent data volume. If you look at the part, it is written as 1000. What does it mean? It means that there are tentatively 1,000 records in this part. And if you look at the same thing as I said earlier, if you know what will be the size of a record, then you can easily calculate what would be an approximate size of the part table. When you implement it in the ultimately physical data-based design. If you look at the sub-types, manufactured part and purchased part, the data volume is written as its size. For example, the manufactured part is written as 400 and the purchased part is written as 700. If you are conscious of what your part is, it is 1000. But when you make sub-types, it is 400 and 700. So, if you sum these two, it will be 1100. So, what does this mean? It is 1,000 and 1100. The reason for this is also apparent in this diagram. And what is that? Because the nature of the sub-type relationship is overlapping. Some of your parts will be manufactured and purchased at the same time. So, the tentative number of records, the data volume in your sub-types, the sum of your sub-types can be higher than the tentative number of records. And you can immediately think that if this sub-type relationship is disjoint, then it is certain that the sum of your sub-types, the sum of their data volume can be equal to the maximum. It cannot be more than that. And why did I say that it could be more than that? Because if the relationship of the sub-type relationship is partial, then there could be some super-type instances that are not linked with either of the sub-types. This means that in that case, the value of your sub-type data volume can be more than the sum of the sub-types data volume. But if you are total, then the sum of your sub-types data volume will definitely be equal to the exact value of the sub-type data volume. So, these are some technical things that you should be aware of. But you should be aware of them. And this information, this data, you should deduct from it. You should take it out from there. It is all there in the diagram. You have to see it this way. Similarly, if you look at the supplier, the data volume in the supplier is written as 50. This means that the maximum is approximately 50 records. Similarly, if you look at the quotation that you have, we have 2,500 of the value that we have written. Why have we written 2,500 here? So, look at the one-to-many relationship that you have between the supplier and the quotation. So, sometimes you do this notation that you have said that the many relationships that we have, you show an approximate upper limit. Now, if you look here, with the figure of many, the difference is that it will be related to the maximum of 50 quotations. Or you can say that one supplier will give the maximum of 50 quotations. It is not the right thing to say. In fact, in one supplier, we will store the maximum of 50 quotations. Before it increases to 50, we either remove it or take it back up. So, it means that the maximum of the related quotations will be 50. On the same basis, the data volume of the quotations is 2,500. This is the data volume. Now, the second thing is the usage analysis. You have different types in this, which ultimately will be converted to the files and how their usage will be. You can see that the arrows that are coming from the outside are showing you the usage. And the arrows drawn here are shown in the dashed line. They are not solid lines. If we take a look at the example, it says 200 with the part. The part has an anti-type rectangle with an arrow coming from the outside. And it says 200. This means that 200 is being accessed. What does it mean? It means that in a special time, what can be a special time? This is only up to you. The diagrams that you are making, you fix the time. You can set the time. You can set it to 1 hour, 10 minutes, 2 hours, 10 hours, a whole day, a whole week. It is totally up to you. The time that you define is that we are denoting the usage analysis based on this time unit. In that, you said in the first example that 1 hour. It shows that the part that is your anti-type, 200 curies or 200 accesses will come in 1 hour. After that, if you look further, the two types of it, one of them is written 70%, one is written 40%. Then the same thing is that when you have the curies, 70% and 40% will increase from 100%. This means that some of them will have such accesses that they are coming on both simultaneously. That is why their number is increasing from 100%. Then besides this, it is written that out of 200, your 140 accesses are coming on the purchase part. And 140 accesses are coming on the purchase part. They are coming through the part. The part that you have expressed, out of 200, it will come after the part of the 140, on the purchase part. The accesses that are coming on the purchase part will show you two things. One is that the accesses that are coming on the purchase part will show you the purchase part. You will see another arrow on the purchase part. It is written on it, 60%. This means that these are independent accesses on the purchase part. The part that is not coming directly on the purchase part. If you look further, between the purchase part and the quotation, there is a line that shows the arrowhead on both sides. This means that from your quotation, you come to the purchase part. If a person accesses the quotation table, then besides that 40 times, in that unit time, 40 times, along with the quotation, he will also access the purchase part. The accesses that are coming on the purchase part are coming from the quotation in that unit time. Similarly, if you look, on the other side of this arrow, 80. This means that from the purchase part, you go to the quotation 80 times. In that unit time, which we also said that we took an hour, an hour, from the purchase part, you go to the quotation 80 times. The purchase part has so many accesses. Some accesses are coming on it via part, 140. Some of these accesses are coming independently on it 60. And some of these accesses are coming via quotation, 40. From here, you must have an idea of which table is more hot in the sense that there are more accesses on it. Obviously, you must have an idea that you can think that about its efficiency, about its security, we need to be more careful about its accesses. Most likely, we need to be more careful about its damage and speed wise, we need to be more efficient because obviously, if it is slow, all the other accesses can affect its performance. You see the same thing between the quotation and the supplier. The application that comes via quotation. Similarly, from the supplier, you have 40 accesses on the quotation. But apart from this, 75 independent accesses are also coming on it. So, this is the whole diagram in front of you. How will you establish it? And then when you will see it yourself or someone else will see it then you will interpret it so that you can see the whole volume of your database and the accesses will be seen. And the individual tables will also see the volume and accesses. Designing fields. This is the next phase of your physical database design that if you have done data usage and volume analysis then it is done. This is the step. Field is the smallest unit of application data correspond to a simple attribute. You can also say that it is an attribute but we call the field more as a physical or practical aspect. When ultimately you are discussing or talking about it because ultimately you have to decide what is your field. For example, your composite attribute you call it composite attribute but you do not talk about the field because the field that ultimately corresponds to the simple attributes. It involves a different decision about fields in this phase. The first thing we have to decide is the data type. The first thing is what is the data type? The proper definition of the data type will be read in the course of the data structure. It may have different meanings but I will explain it again so that you can relate it with that definition. Data type is defined as a set of values along with the operations that can be performed on them. When you talk about data type, values are their operations. They are called a data type. For example, if you talk about the programming languages, for example C++ there is a data type integer. Because you said integer means a certain set of values that when you declare an integer type it can take any value out of these values. It can also be called a set of possible values. The operations are defined as an integer data type. They can be operated the same as the case with float and other data types. When a data type determines a field, basically the domain is associating this field with its legal values. Apart from this, you cannot give it any external value. So what we were talking about is that in field designing, you have to select your data type. Precisely depends on the particular DBMS. We will allocate any data type. In our mind this type will be of this type. We said salary will be numeric value or date of birth or date of joining. But precisely what term will be used and what will be its name. This depends on the DBMS. For example if you look at SQL Server it can be number or integer text or character but when you are in Oracle you will get a var chart or something like that. This depends on the DBMS. But in your mind we have to give this type. But which word depends on the DBMS. Now when you have to allocate any data type then when you take care of these things these are your objectives. The first is to minimize storage space. You have to see the field we are designing what are the possible values As I said you don't have to consider the current situation as well as the future expansion at least 4, 5, 6 years you have to take care of the future. You don't have to think about the future for 50 years. You have to start from now. It is not like that. You can estimate or discuss it with the user. For 4, 5, 6 years you have to see how much expansion can be made. You have to see the same requirement. Don't make it so much tight. For example, you said age. You said the students who are in the exam system can come at the age of 25. Was it necessary? You fixed it. One student came at the age of 30, 30, 30, 40 and left at the age of 26. Since you have restricted it at the age of 25 you said I won't let this student enter. You don't want to enter the system. As long as the student's data is not in the system then your system is not a student. So you have to keep it flexible so that there is a bit of expansion. The minimum that you can take as much as you can the type that you take at least you can take it. For example, to store the number generally in languages you have byte integer and long integer. What do these three types do? They provide you different size different domain you can give it from here to here you can give it from here to here. First, you have to store the age. You can define it there long integer. Long integer can store up to 1 crore. What did you think that it could store up to 1 crore? If it was the age of the galaxy then it could have been. But if you wanted to store if you did this basically you wastage the storage. So whatever you see in the field you can see how much it can expand the maximum range. The amount of storage that it can fit you can type it. It also has represent all possible values. You just said you can squeeze the storage so you don't have to do this. The possible value that you could take as I have given you an example of the age of 25 to 26 if you try to be 25 to 26 then there is no such thing. In fact, according to the future needs the possible value can be accommodated there. Improve data integrity In this, the data that is moved should be the same as it should be. For example, you said age. If you want to define age then you should try to get data that is actually age. What difference does it make? The difference is that if you say age I represent it as text. The numbers that are digits can also be considered as text. But when you declare it as text or you declare it as alphanumeric then the problem will be that if a person enters age by mistake x, y, z, a3, alpha, bravo then if you type it as text then your Dbms will not object. Why? The value that you declared that is being currently entered that is legal according to that data type. So, if you do something like this then you are creating a provision that the data integrity gets disturbed. Now, don't bring this to your mind that how can someone enter x, y, z in age? How is this possible? This is possible. Look, a very strange thing that should be dangerous is that a person develops and someone else uses it. The person who develops it he is very much familiar with it. He knows the behavior of the system how the system behaves and how it behaves. Another thing is that that is a computer professional who knows about the computer and knows that I should give it a number but mind it. The person who uses it may not be familiar with it from the computer and such things should not be understood of a mistake of an error. So, you have to take care of all those possibilities. So, this is also a way that the data integrity should be maintained in a database that is legal, correct, possible. For example, support all data manipulation. This is also linked to a data type on the field. The kind of operation that is possible in that environment that you are working in the data type that you allocate to the field supports all those data types or those data types can be operated. Just now, as I defined the data type I said that the data type is operations and values and operations. This can be operated on this data type. You have to see that the field that I am going to define I am going to associate a data type on that field in this environment. According to that you select a data type. You consider the values and the operations as well. For example, as I told you it is a basic salary. A basic salary is for an employee based on that you have to calculate the value of the increment. For that you have to calculate the mathematical value. If you store it in the form of character or text then the text that is your data type does not support the multiplication. Then what will you do? Then you have to do something else. It is better to allocate such a data type that is legal possible that you have to do on that particular field in your organization. Basically these were the objectives that you select while assigning a data type to a particular field. I hope you have cleared it. Coding techniques Again this is storage efficiency. That is the field where you have to store the data. If you see that through coding we can save our storage. If we can store it more efficiently then you can apply coding. Coding techniques are possible but the idea is that instead of actual value we can store the code. With that let's see. The attributes with small domains can be replaced by codes. Small domains The attributes with the possible values are less. You can use the code in that case. I told you that all these are different options. No one will tell you whether you should do it or not. This is purely your own decision as a designer, as a developer. In the case where an attribute is very small there can be many attributes. Similarly if you look at the salary the salary is not the possible value. It is a wide range. There can be any value. If you look at the name I think of coding the names unless it is a specific environment there are so many possible values that if you want to code that won't help you too much. There can be no name coding. We have an example of coding here. Codes can be stored in a lookup table or can be hard coded. You can create a code that has less attributes and you can define the code and create a separate table of its associated value. And the actual table which has a field can be stored only in the code. But if there is any value of one code you can refer to that table. We call this table a lookup table. This table is generally the stage of your physical data design where you introduce them here and create them here. So they don't come in that design. For example, we say age group. We say age group is a place where people have data. There are family members. We say age group. We say that a child a young man an old man and a young man. There are possible statuses of any person. So if we write a child then we write B, A, C, H, C, H, A. How many characters are there? Two, two, two, six and seven. Seven characters are there. There are seven characters in the nine youths. There are seven characters in the youths and in the old man and the old man. Out of these, there are about five to eight characters that you have. We have 15,000 Codes. In 15,000 Codes each of our records has one out of five statuses. So instead of solving these seven or eight characters in each record, what should we do? We should do coding for them. For example, we should do B for the child or B for the child or A for the young man or B for the old man. So you can define these five single characters by yourself. The 15,000 Codes in each record, instead of solving six or seven characters, solving six or seven alphabet, solving six or seven alphabet, you are solving just one character. So you can think that on average, let's say you are solving six bytes on each record. You are solving six spaces. So how many will you save on 15,000 Codes? So many. If you increase the number of records, you can do this and its example is here. Look at this. We have a table of students. To short it, we have three attributes, student ID, student name, and hobby. If you look at the hobby, we have reading, gardening, movies. There can be two, three, four other hobbies that are of general use. If you look at the gardening, there are two, four, six, eight, but no, there are nine characters in this. So if you think about Lahore, especially if there is kite flying, it is a very common hobby. So kite flying will be a big word more than gardening. So if you store it properly, the whole word or the whole hobby, then you will need more space. But you can do coding like this. We have a hobby, we have created a table, in which there is only code and hobby. We have said that R is for reading, G is for gardening, and M is for movies. If we store K here, for kite flying, then this is our lookup table. When we talk about student data, we see that the hobby that is our attribute, instead of storing a complete value, we have just stored the codes. And you can compare yourself with the actual value that is taking a lot of space. It is taking quite a lot of space. And the code in this case is only one character. That is why I told you that in order to maintain the code simple and number of code small, we have to use the coding only on those attributes that have small set of possible values. In this case, you should not use that code. I hope that the purpose of coding and its approach will be understood. Controlling data integrity concerns the possible values that a field can assume. See, when we talked about data type, I told you that you have already imposed the condition of the field of the field. Finding a data type for a field is also applying a constraint on that field. That is, what you told that this is of type number. Now, the thing that you are sure about is that you cannot enter anything in A, B, C, D. There is no comma, dot, plus, minus, enter. The date that we told you about cannot enter A, B, C, D. There cannot enter five digit consecutive entries. So, defining or associating a data type with a field is the first stage of enforcing the integrity control on that field. Apart from this, integrity control matters. We are discussing here. Apart from data type, the remaining integrity control you can define is default value. Default value means an attribute that we do not give any specific value to. What value should be given to it? This is critical because generally if you do not specify the value of an attribute of a field when you are entering the data, the value assumed is null. Null is not a zero or empty space. Null is a specific value that is null. Null means not defined, not given, not there, nothing. If a number type is a salary, if it is written as zero or null, they mean two different things. Because I have already said that zero is a specific value. But null means no value, not given, not defined. So, if you are doing data enter and you are doing student data enter and you miss student ID, you cannot miss because it cannot be a primary key. You gave student ID and you gave the name of student ID. At that time, you did not enter the student's address based on any reason. There is no need to hurry. You did not enter the student's address. So, if you leave it empty, if you leave it empty, null will come there. This is a general approach. But in some things, if I leave it empty, then this value should come in it. There can be many examples of that. For example, we say that we have a registration system of the educational institution. In that, we say that students who enter mostly talk about their age when they come to school. Most of the time, they are 5 years old. So, if we have a child who is not 5 years old, for example, if he had a mother who was very fond of studying, he would take her at the age of 2. If the school teacher is smart, he would refuse. So, the child is 2.5 years old. In this case, you will enter the age of 2.5 years. But the normal mother in the routine, if she is going at the age of 5, then there is no need to enter a specific value in it. You should leave it. And the DbMS automatically in that record, in that field, the default value, in this case, if it was 5 years, this is also a sort of integrity check. The advantage is that this field will not be undefined without a value. But there will always be a value. Either you enter it specifically yourself or mind it. The default value will be effective when you do not enter the value specifically. Whatever value you enter there, the default value is overwritten. At that time, it will not be because the default value is 5 years old. Now, the child was 2.5 years old. He said, no, I am 5 years old. It is not like that. The value in which you entered will be the same value that you entered. But if you do not enter a value, then the default value will be there. I am explaining it so easily that it may be possible that you are familiar with the concept of default value. What is the role of the default value? And how should you define it? Again, don't do it yourself. Always ask the users of the system who are making the system. Ask them to do a salary. You say, what is the default value of the salary? If they ask you, do that. It is not that by default there are 2,000. And you enter it at 20,000. So always ask them. Range control. If you have declared someone's number, for example, there is a salary. If you say that the number is right, you said that the employee's salary can go up to 5 digits. It can be 10,000, 11,000, 12,000. It can go up to 99,000. The question is that when you say 5 digits, it will go up to 99,000. But the company is not generous enough to give someone a salary. We have the maximum limit up to 25,000. If we give, we are 6,000. But let's keep our limit at 25,000. In that case, there is another type of integrity check. You have to control from the mistake. If the company says that it can take any B.PRO, then from the mistake it can go up to 26,000, 50,000, 70,000 or 80,000. To do this, you have a further check that first you define the data type that the number is right and it will be of this digit. Then you have said that no, even in this digit, even in all the numbers, this is the range. For example, if we talk about HV, the maximum value is 40 or 50. So that if you have given the number, then the age of an experiment will be 200 years. This also helps you to implement a control on the nature of the value that is being entered in the database. Dear students, remember that when we discussed data as a resource, I told you that the data that is going in your database it should be very much accurate. It should go into the perfect data. And you control this through these integrity controls. Why? Because your data provides you a basis for your decisions. Your decisions depend on your business. Bad decisions can bring a bad fit to your business. It can collapse your business. And the basis for decisions is data. So if your data is inaccurate, incorrect, wrong, your decisions will be wrong. When decisions are wrong, you can understand. So always keep in mind that the data that you are entering in the database should have all kinds of checks. What is the meaning of all kinds? It depends on the nature of the data by default. For example, we say that the date we know that the non-leap year is not 29 days or 30 days or when we talk about the name we know that the name doesn't have digits. These are the checks that can be applied by default. This is a constraint that will come from your organization. The organization that you are studying you should ask them what are the checks on it, what kind of values are entered on it, what kind of values will come in it and then you impose them, implement these checks. And the integrity controls we are reading is the same. Apart from this, there are many examples but these are the basics that you will define and implement during your physical database design. Null values, again a very important decision that you have to see which attribute we can give null value to and which not. What is the benefit? The benefit is that when you declare a variable it is not null. You have just declared that it is not null. The concept is beautiful that the Dbms will always take care of it whenever you enter a record the fields that you have declared are not null. They will take their value as important. If you forget because human beings are always there then if you forget even if you make a mistake then Dbms will not make an error. It will always tell you to enter value in this attribute without that it will not accept the record. If you forget it will remind you again to enter value in this attribute. How would you define not null thing? Again this will be defined as a system. The primary key is that it will automatically become not null. Not only the primary key but if it is composite then it cannot take any attribute. The primary key is always not null. Apart from this the attributes that you are compulsory you will declare them specifically as not null but if you do not enter then you will have to declare them as not null. Referential integrity is a beautiful concept. This is an important thing and see how it keeps you on track. The person who is doing the detentor has a straight line that he cannot do the wrong detentor. We have already studied that the thing you have declared as not null or the corresponding home relation will match with the primary key. This is an important thing to say. You have described it briefly but when it is being implemented there is a huge support from the DBMS. You have said that it is not null. Now it will always allow you to enter a legal value. It has a lot of significance. If it is not then all your references will not know where to go. Why? First of all let's talk about the employee and the project. We have added a project ID in the employee in which project it is working. If you get a project reference from a employee's record then you will know that there is no project of this reference. It is working on a project for 5 months according to your database. But when you want to see which project it is then it is not a project. If you look at it like the department and the employee if you look at any of our exam system or library system students and books have added the student ID. If it is not being checked by the Referential Integrity Constraint you have seen a book with the student ID. You have said that it is not a student ID. You can see that it is not a student ID. There is no sign of it. It is a minor but it is also an important thing. How to maintain the integrity of the data helps you to have references and find it. When you develop a database as the size grows there are 10s and 20s and 100s of cross references. One table goes to the other and the third one goes to the third and fourth one goes to the fourth. But when you declare it as a foreign key and you impose the Referential Integrity Constraint on the DBA then these references are valid. And not only they are valid but they are always valid. I have already given you an example that you linked the employee to the project. It was a project at that time. Yesterday we said that we will delete this project. When you put your hand on this project it will tell you that this project is being referenced somewhere else. Think about it. If this employee tells you about this employee then the long-term chain of the primary foreign keys will take care of all the chains. Beautiful, isn't it? Dear students, today we have read about physical database design. The concern of this phase is that the technical specifications that are related with the practical implementation of your database of your database design and this I will say again that the primary factor of this phase is efficiency. Efficiency in the data processing. One thing. Secondly, in this phase you have got multiple options. No one will tell you to adopt this option or this option. All this is going to be your own decision. On which phase? Now you can use your experience and the requirements of the user. And generally, technically, you can think about it yourself. There will be a lot of things that the user did not give you that specification. But as a technical person you can think about it. For example, as a storage the efficiency of storage is not a concern of the user. He does not know. If there are three hard disks the user will say that they are looking but you know that if I do this then there will be three hard disks and two hard disks will work. So this is your concern as a technical person. So in the physical database design the options that we will read we will look at their plus and minus and then you will see who we have to adopt. This discussion is also with us. We will continue in the next lecture.