 We continue our discussion to see how the actual relational database management systems of today. You must have heard names like Oracle or SQL Server. How exactly did they evolve? What is their architecture? And how SQL has evolved as a programming paradigm? So if you look here, we'll see what are the problems with conventional programming if you want to build and operate information systems using conventional programming. What are the database systems? What are their early evolution? We'll then look at SQL as a programming paradigm. In the second session today, we'll actually start introduction to SQL. We will briefly look at the architecture of DBMS. Inside DBMS, after all, we write programs in SQL. For example, we expect somebody to execute. Who is that somebody? That somebody is a database management system. And how is that system architecture as a piece of software? Because that itself is a collection of programs. So what are various programs? What do they do in order to execute our applications and so on? And then we shall see what is known as application architecture. Terms like multi-tier architecture, two-tier architecture, three-tier architecture you must have heard of. We'll try and understand what exactly these terms mean. These are nothing but different paradigms of writing an application in such a manner such that different components of applications, our application, including the back-end system software such as database management system, etc., and any other components which are required could either all stay on a single machine or they could be on multiple machines. In the days of network, as you know, there would be different components working on different machines. We shall see how exactly things are architecture in that fashion. Going back to the conventional programming, I presume that all of you are familiar with some programming language whether it is FORTRAN or CE or BASIC or whatever. So the early application software, that is information system applications, were typically written using a third generation language. 3GL stands for third generation language incidentally. It's called 3GL because the first generation language was called the machine language with which you write programs for the actual processor of a machine, CPU of a machine. And these machine languages would be different naturally for every processor. So Intel processor would have its own machine language, Motorola processor would have its own machine language, etc. And therefore, in the early days there was nothing else, so people wrote programs using that. They usually binary coded programs 0 1 0 1 1 0 0 0 0 1 will mean add this content of this memory location, that memory location, etc. The next generation evolved when assembly languages were defined. So there used to be some programs called assemblers. You write programs in assembly language, then put them as data to the assembler program and that assembler will produce the machine language. This is the first rudimentary translator. Till this time actually there is no concept of generation, machine language and assembly language. But when procedural languages were defined, 1956 first Fortran language compiler was written after Fortran language was released. They started calling these by generation saying that procedural programming languages which are English like languages or languages which non-computer science people could easily understand write programs and they call that third generation language. When you write programs in procedural paradigms such as Fortran, Cobol or Pascal or C we notice that we have facilities to define memory and storage data structure. For example we can define variables, arrays, people are familiar with variables in arrays, link list and trees you may not be familiar with. But these are data structures which are actually more complex data structure defined to permit very easy manipulation of data from a perspective of the structure of the data. But we do not need to concern ourselves with these. You also have facilities to define sequential files. In absolutely all programming languages you could define sequential file and use them. Most third generation programming languages subsequently started permitting use of other file structures such as indexed file. We have already seen the notion of an index so you have the data in a file like the store clerk's laser and then here's the first few pages which are index pages. So you could have million records and you could search linearly or sequentially in them which was useless. So you created some index file. So you first read located the index of the particular item that you are looking for in the data file. That index file will contain the key and the position in the sequential file. And if that file was on this you would always access that particular record directly. These facilities are also available. And then of course you can write algorithms to manipulate data. One problem with the 3GL programming language was that all of these definitions and the entire algorithm had to be contained in every program which you wanted to execute to do any information process. So if you have three files and all three are used in 20 different functionalities and for each functionality you have written 7 program then each of the 20 programs must contain those file definitions. Must contain the appropriate variable definitions etc. Which was messy as I mentioned earlier that if you for some reason change the definition of one file then you are a problem because you had to change the entire program. In fact if you change the sequential file to an index file then not only you have to redefine the file definitions in a programming language but in order to exploit the additional feature of the index file you have to change the algorithm itself. Because in that algorithm you will have to give commands which apply to index file and not to sequential. So any change in anything had huge reflection on everything that you had written because as you can easily guess you would never build information systems whether 3GL or any other GL programming which would contain exactly one program for everything. You would write hundreds of programs and you would have tens and sometimes hundreds of them. So this was becoming messy. These were the problems. Standard data types were not adequate. Which are the standard data types in conventional programming language? You are familiar with FORTRAN or C? So you have integer floating point. You are all familiar with that? How do you handle money transactions which require exact arithmetic? 3 rupees 33 paisay. How do you represent 3 rupees 33 paisay? You will how to represent this as a floating point variable? And you are familiar that internal floating point variable representation is invariably binary and therefore there is no exact equivalent in the decimal number system for some of the binary. So consequently when you require exact calculation you needed decimal representation of numbers in any programming language. Most programming languages did not permit such a thing. COBOL was one programming language which permitted that. Date representation had a huge problem. Typically people represented date as a 6 digit number ddmm yy. This resulted in the year yy being represented by just 2 digits. Since computers evolved in 50s and 60s and nobody thought of what would happen 40-50 years hence people were comfortable. You will notice that in 1965-66 when the information system started getting built most of the applications such as let's say employee application could carry the burden effectively in this representation. You wouldn't be employing 70 year olds, 60 years old. So most of the transactions would happen around the date and therefore even if you take 10 year loan then the loan repayment date could still be represented because that was well within the 2 digit representation. 1900 was assumed. This gave rise to huge issues subsequently around the turn of the century. You might have heard about the y2k problem. So year 2000 suddenly had to repent by 00. And 00 it was no way to define whether this 00 was 1900 or 2000. Cleverer people used an 8 digit representation earlier ddmm yy. This permitted representation. Still it did not permit representation of dates before Christ. So if you are writing a history book and you want to store some structured data of that history book inside a database which is managed by third generation programming languages there is no way of saying that Samrat Ashok ruled on these these these minus 300 BC to something like that et cetera. So you can see some problems always. Second the date arithmetic is weird. You are often required to find out that if let's say I have given a loan for 45 days on January 16th. When exactly 45 days will get over? Can I add 45 to 16 and get some date? I will get some number. But will that number represent the correct date in March which is exactly 45 days from today? Because you have to know that January has 31 days, February has sometimes 28 days, sometimes 29 days. Now this arithmetic logic for date will have to be incorporated in your program explicitly because programming language did not support data arithmetic. Interval between two days cannot be found by simple subtraction obvious. Then each program as I said is a complete unit that means data description or metadata is needed in each program. Variable definitions, file definitions, file. Not only that verse, the actual physical organization of file is explicitly stated and included in your algorithm for every program. So even if you kept the data organization of the file same but change the organization from say sequential to indexed as I mentioned you will have to change a whole lot of program. It was very soon clearly visible to most computer science experts that going forward this way you cannot build useful information systems. There are further problems. Transaction logic to provide acid properties not available intrinsically. What does it mean? Remember what I mentioned? You go to the bank and you say collect a check. The check represents the money is to be transferred from account A to account B. Suppose he is executing a 4-turn program and he has deducted 500 rupees from account A. Now he has to pay you 500 rupees after it is transferred to account B. And suppose the machine dies. So what would happen? In your files your 500 rupees gone from Mr. A. Mr. B does not receive anything. Bank is happy but the customer is not. Suppose to keep the customer happy you first say give customer 500 rupees and then debit. What happens if the machine fails after customer collects 500 rupees and walks away. So you have serious problem. That means you remember I said the fun property that a transaction must have is if there are multiple steps like debit here create here then either both of them must happen or nothing should happen. And if something goes wrong with anything I should be able to recover back to the original consistent state of data. Doing this was not very easy. So 4-turn programmers will build I mean we have early systems which are written including our railway reservation system for example in India was developed in 4-turn. So they had actually required they were required to write recovery routines. So what they will do is a transaction happens. So you book a ticket or whatever and you money changes hand or whatever. They will not update the database pass. They will also put a log record that log record will be written onto a magnetic tape or some such file. And then they will issue these things. Now in case something goes wrong the log records from the tape would be taken out and would be seen whether a particular transaction was completed. Meaning where their logs available for all 2-3 steps which are required then they will say yes database consistent. Otherwise they will undo those transactions. It is called rolled back in modern terminal. All this was required to be programmed explicitly in 4-turn C or whatever you want. It is not an easy thing a very basic thing. Concurrent usage poses a problem. Concurrency means multiple people are trying to access the same files. So let's say there is a 4-turn program written for withdrawing money from an account. We're drawing money requires that you debit the account holder and give the money cash and record the updated balance. Now imagine that there is a bank. I go to one window without my knowledge my wife goes to another window. Now I have let's say 1000 rupees in the bank and I take out 900 rupees. Your logic will be debit that 900 rupees and you are going to write the balance as 100 rupees remain. But before your program writes 100 rupees at the other window my wife has said 800 rupees. And it has read the data from the same file and it says yes 1000 rupees. So it gives my wife 800 rupees gives me 900 rupees. And finally whosoever is the last person doing transaction either 100 rupees or 200 rupees will be still shown as balance in matter. You can't permit that to happen. So what you'll have to do the moment I start doing this transaction the moment that my program reads this data it must lock that data. That till I finish the whole transaction nobody has to be able to read it. Now imagine there are 200 users logging in and each one has a transaction comprising of 15 steps some of which are in this file some of which are in this file. And effectively everybody starts locking everything then at one time only one fellow can execute. So effectively are back to single programming mode no multi program how do you handle such things. You could people wrote programs but people said this is messy. I can't build such programs and use them for every application that I write. As I said crash recovery therefore was not part of the framework concurrent usage used to pose a problem. Then there's no security for data access. So data is there on the on the desk and there is a photo program which let's say permits me to exact the money. Okay. It is entirely possible for some shenigan in the band to take that program and execute it in the night and clean up all accounts. So we have no mechanism say hey this is a bank of a person not a cashier he is not supposed and he is attempting to access database at 11 o'clock in the night. Should not be permitted to do common sense security. There is no standard feature in a programming language by which you can build it. So you have to explicitly program for that it could be done but it was difficult. And there is no interface for querying data. If you wanted to extract let's say you have put a student file that we described last time in a file and you wrote a photo program to extract the data. You had to write a photo program or see program to extract the data. There is no direct interfacing. I have a file on the desk and I just want to get I want to say something like get me quickly roll number of a person who has the highest CPI. There is no clear cut interface for query. You can see all these problems would be very serious if you are constructing a business information system. If you are creating an algorithm for handling sparse matrices that's a different story. Please understand that computational algorithms that most of you would have been traditionally writing are invariably single user single data kind of connection. Security assets are not there. If the disk crashes you can simply start re-learning the program and read the data and do everything that you wanted. Of course you would like all these features but they are not essential features. They are desirable features in most of the computational problem. And that is why these problems were realized only when people started using 3GLs for application development for information system. The database systems notion evolved in order to address all these problems. There is a committee called CODASIL. CODASIL stands for committee for data system languages. CO is for conference on data system languages which set up a committee for data systems languages. So CODASIL. It proposed a model called network model. It defined a network model in late 60s. We still handle this record at a time processing as we do in conventional programming language. IDS, IDMS, Total, Adabas are the names of some very popular network relational database management systems of the yester years. The entire modeling was around treating the organization of data as a network and permitting access in a peculiar fashion just as you would access the network but still handling a record at a time. Subsequently a hierarchical model was defined. A slight improvement over this model which was called IMS as the most popular representative hierarchical model which was developed by IBM. Incidentally absolutely nobody in the world creates any application, new application using any one of these. However there are several applications in information system which have been developed using these and are still today running on these basic software pieces which are defined over network and hierarchical. The inflection point in the development of databases occurred in 1970 when as I mentioned last time Professor Kaur wrote a paper describing the relational model and for the first time introduced the notion of a set at a time processing. That means a relation is being processed resulting in another relation. So far all processing using procedural programming languages was one record at a time. You open a file, you read one record, do whatever you want to do with it then go and read another. So there is nothing like there are 20 records here, 100 records here, process them as a natural join and give me something. No access model like that existed earlier. The evolution from 1970 to 1974 continued using course theory. There were two parallel development that was happening. One was the development of system R at IBM San Jose or IBM Almaden lab which subsequently resulted in a commercial product called IBM DB2. And all modern well-known products on databases, Oracle, HP is always, tandem, non-stop SQL etc. Or now my SQL we will see some of those when we study SQL. Today in fact when you say DBMS implicitly you mean a relational DBMS. Unless you specify you don't mean anything else. In the University of California at Berkeley during 1970 to 74, Ingress was developed. In fact technically Ingress was ready for usage at University Berkeley before the system R was transformed into a commercial product. So Ingress is the world's first relational database management system. Time difference was minor because the work was going on simultaneously. Of course this was a commercial product. Subsequent to Ingress came various products. Sci-based you would have heard of, MS SQL server you would have heard of, Ranked space which was an early relational implementation and so on so forth. The two workable prototypes Ingress and system R proved practical utility of relational model for the first time. And and see American National Standards Institution standardized this paradigm of relational database and SQL it started in 80s, 1980s. Just as C4 term cobalt or procedural programming languages. Just as Lisp, Prolog or AI or functional programming languages. And just as C++ or Java are object oriented programming languages. You will have heard these terms although you might not have programmed in this programming language. Suffice it to say that procedural languages with which you are familiar with is one way of writing programs. Functional programming is another way. Usually you use it for artificial intelligence kind of thing. And object oriented programming is still another way. These are three different ways of programming paradigms. That is why when you know Fortran it is easy to know cobalt or C because there is no different paradigm required. Similarly if you know C++ learning Java is straightforward. But knowing C and having to learn Java is not that straightforward because the concepts new concepts have to be learned. In exactly this fashion there is now a fourth paradigm of programming which is you may call spatial purpose language for managing data. This was needed and SQL emerged as an answer. So SQL is now the lingua franca of information system program. This is the history 1970 Professor Cod proposed the model. 71 to 79 SQL implemented in system R of IBM. 1980 SQL became SQL. Still pronounced as SQL by many people. 1986 SQL recognized as ANC and ISO standard. 1992 SQL 2 came out with a whole lot of announcements. 1999 SQL 3 came out and today SQL 3 is the standard. DB2, Oracle, SQL Server, all are commercial products. Postgres SQL, MySQL, these are all open source databases. And they compete very well in many applications with the commercial database. I don't know whether you are familiar with the fact that the Google search engine which probably handle the largest amount of data in the world. They all deploy MySQL open source servers on thousands of Linux machines. So they are parallel databases. We will talk about this notion later. The SQL programming language has multiple components. The first major component is called data definition language. Much like in your program when you write integer something something floating point something something character something something what you are defining. You are defining the representation of data inside a program through variables arrays etc which we will use. In short you are defining metadata. The metadata definition or the database schema the relational schema that we talked of in the relational model become database schema in the context of database. The DDL specifies database schema and it can create modify or destroy tables indexes and views. Index is not a notion associated with relational algebra. With every practical database table you have to associate a notion of an index because after all you want to search things faster. We can't understand the meaning of view at this stage. Suffice it to say that view is a logical table defined over several base tables through some relational operators. We will see what exactly it means later in the course. In short the data definition language permitted you to define metadata. One great advantage of this componentization is that once you have fixed an application it didn't matter whether you are 100 files or 200 files or 10 files in your conventional application. You would now have 100 tables 200 tables 50 tables but the definition of all those tables would be given at one place to the DBMS. And that definition will be handled by data definition language by separating out the data definition and data manipulation. You achieved one important thing that the data manipulation program which would subsequently write which are your algorithms. You need not even mention anything about metadata. Every manipulation algorithm will know implicitly that whatever definition I require are there in the data definition language created metadata plays somewhere and every program can access. Consequently if you change the data definition for any purpose unless required by an individual program you don't need to change that program. DML is the data manipulation language. The fundamental objective of DML is to query the data in the database. All the SQL queries that we saw as examples last time are nothing but programs written in DML, data manipulation language. What we have not seen yet but which obviously is possible and must be possible is that SQL has statements which can manipulate the data using insert, delete and update. We have not seen these statements yet but they must exist obviously. After all if you are querying data then to begin with somebody must put that data inside. That is called insertion. If a student leaves you should be able to delete the student's record from the data. You can't do that by using select statement so there is a delete statement. And if a student CPI changes which it does every semester you should be able to modify or update that value. Consequently you need these statements and these statements are available. So DDL and DML everybody is clear about these? There is a third component of SQL called DCL or data control language. This data control language permits additional features and functionality. For example it can grant and revoke authorization of database access. What it means? It actually is building security for you. So you can define that these students can access these tables. It's much like file access. So only this user has ownership to update. Others can only read. Only these people can write. But that's a very simple form of control or security. You also want a security of the type that if the group of students wish to look at their marks, they can do so only in the marks for courses for which they have registered. So I can go and check somebody's mark in some other course. You agree with this security? Although not relevant because at the end everybody knows everybody's marks in an academic environment. But consider salary. Now let's say I got a raise recently and if the raise was very small, I would not like other people to know how much raise I got. I raise for 5 rupees a month. I should be able to access that and only my manager should be able to access it. In the United States this is religiously followed. I mean to the extent it's so manhouse that if somebody gets promoted as a manager, invariably the first thing that manager does is checks the salary of all his earlier colleagues who are at the same level and whose salary he could not know. But now he's a manager, he has access. But these kind of access controls in real life are obviously necessary. Take grade change. You authorize only an academic office supervisor to change a grade. If you have a database, I am a student. I am able to see my grade. Imagine what would happen if I am also able to change my grade. So this security has to be provided for and this security should not have to be programmed explicitly by anybody. For that database management systems provide a component called DCL or data control. So going back to the previous slide, we have DDL or data definition language. We have DML or data manipulation language. The DDL permits us to specify database schema and some other things. This is the definition. DML is essentially our SQL queries and which also has insert, delete, update, statements and DCL which permits data control. There's one more thing which the data control language permits and that is management of transaction. Remember we talked about atomicity. We talked about multiple users attempting to update the data at the same time. That management is handled by the DCL component. Additionally, it also provides auditing the database use. Somebody looked at this data. Somebody updated that value. Who did it at what time, from where an automatic audit trail is generated and stored somewhere else. Say why do some hanky-panky, I can be caught later. Audit trails in accounting are famous to be very critically required. Ordinarily in programming languages, you will have to actually open a file, manage it, write audit records, etc., which is not required. Let's look at data definition language a bit more closely. It permits us to define tables and attributes. It permits us to define spatial domains if needed. What is a domain? Mathematically, what is a domain? Domain defines the type of values that will exist in a set which spans that domain. So domain, that is integers. Domain of rears. Domain of integers are only integers. Here domain of hostel numbers. So in IIT Bombay, that would mean 1, 2, 3, 4, 5 up to 13. 15 will not be there in that domain. Domain of grades. What will that domain, what can that domain contain? It will contain a, a, a, b, a, c, f, f, e, e, r, r, x, x, z, z, whatever are the valid grades. But nothing else. PQ will not be a valid string in that domain. Then it will contain constraints of attributes enforced by the SQL engine. This is an extremely relevant and important aspect of SQL. That you can define constraints on the data such that wrong data does not go in at all. For example, domains. You can actually define domains to be character, this and so on. We can also define constraints or the domains of the type that I mentioned. Enumeration and so on. But more importantly, a constraint definition on attributes which is enforced by the SQL engine. For example, I mentioned grades. I can define a constraint while defining a table that this attribute is grade. And the grade cannot have a value other than this, this, this, this, this. I have defined that. What it means, you know? No matter what program you have written, no matter what data you try to insert. Suppose you try to insert grade for some student and suppose you have given a grade called PQ. In an ordinary programming language, it is the responsibility of the programmer to check whether that grade is valid or not. If you have defined the constraint like this, then when that value is given to the backend database engine, it will examine it against the constraint definition. Say PQ, not a valid grade. That will give you back an error. Immediately, your responsibility of writing complex code reduces significantly. Indices to be maintained can be specified. Security and authorization information can be specified. All this is done. Please remember that security and authorization is imposed by data control language. But it is described or defined using state menstrual part of data definition. Here is an example of table creation. You are familiar with the table course? We had an example of course as an entity set. We had modeled it. Look at this table and check whether it makes sense to you. It says create table codes. C code, character 6. C name, where care 40. C credit, numeric 4 comma 2. Constraint, valid credit. Check, C credit between 1.5 and 20. Constraint, course underscore pk. Primary key, C code. For all this means to you. Can you fathom out everything that states here? Perhaps not. Can you fathom out a few things which appear familiar? Well, the table name course looks like our course table and appears to represent the entity set course. Welding C code, character 6 appears to define an attribute of course and appears to say that this attribute shall have values which are 6 character long. So, CS blank, 6, 3, 4. CS blank, 0, 0, 1. HS blank, whatever, whatever could be all values. Any string which is 6 character. C name also appears to be an attribute. This where care perhaps means that while it is a character string, if I say character string 40, it need not exactly have 40 characters in every value. Variable character is the moment. Many of you can guess that. C credit, numeric 4, 2. You can again guess that C credit is an attribute of the course. It represents credits of the course naturally. Since credits could be 1.5. So, there has to be a fraction. And since credits can be 2 digit for second stage, third stage of project, etc. You say C credit, numeric 4, 2 probably means that the credit is a 4 digit number where there is an implied decimal point just 2 places to the left from the right most. So, 23.55 is a valid value inside. Mostly it will be 23.50. Although the word constraint does not mean much to you, suffice it to say at this stage that this is an artificial name that we have given. This does not represent any attribute. But it represents spatial constraint on some attribute. For example, constraint valid underscore credit. Check C credit between 1.5 and 20. So, this is a constraint which says that C credit is an attribute. Its value must be between 1.5 and 20. Any other value anybody gives, don't insert that value and give back an error indicating that constraint called valid credit is violated. Exactly in the same way constraint course underscore pk. This is actually used to define primary key. So, this is primary key C code. It says that while I have defined this attribute. See in our model how did we define a primary key? We underlined it. In a programming language you can't underline. So, and also what if there are 2 or 3 attributes which together form a primary key? There is one way of defining that primary key. Constraint course underscore pk primary key C code. In this case C code is the primary key. We shall see more about these things just an introductory. So, you are comfortable with the ability to define a table now. At least how attributes can be defined. You of course need to know that just as I have character, I have healthcare, I have numeric. What other things can I have? These are nothing but data types in the programming language called SQL. Just as your integer floating point something something here you have something else. What is it that what you would like to know? We shall see those things later. We now come back to look at inside a database management system. Please note for the entire course our approach is to look at the data that is stored in the database and how to manipulate it, how to construct the information systems, how to write application. But it is important to understand what is inside the database? How is my SQL handle? How do I write programs using a database? We shall see a DBMS internal to some extent now. So, what are the components of database management system? First you require a component which will manage the database users. There are these 65 or 85 students who will be accessing the database for CS634. So, these are the users. What are their rights? Who gets to access what? Who gets to update? A TA can upload marks. A student can see marks. A teacher can avoid grade. Now all these users and their management of the data has to be done somewhere. So, there is a component for abstraction level is provided by some components. There is a notion of logical and physical view of data. The logical view of data is what? The schema. Logically you are saying my student, my course, my registration, these are the tables. You don't care whether these tables are stored as index files, sequential files, where they are stored on the disk, whether they are stored on the tape, or whether there is a human being who is writing and reading data. It does not. Logical view is logical. But you also need a physical view. Another great thing the database does is it separates out the logical view and physical view as I mentioned. Then there is query processing. When I actually give a query, I need to process that query to get the results. Consider a natural join. How exactly should that natural join be executed? I will give you one example here so that you understand the notion of optimization. So, let's go over to, this is my relational expression. R1, natural join R2, where the parameter A1 of R1 should have a value greater than 10. This itself can be expanded further. R1 cross R2, sigma, let's say R1 call comma A5 equal to R2 comma A4 or A5, where A5 and A5 are exactly identical. This is the join condition. In SQL, how will I write any one of these? I will write select, let's say star. There may be a projection but it doesn't matter. Star from R1 comma R2, where R1 dot A5 equal to R2 dot A5. Let's make this R1 dot A6 for example. R1 dot A6 is greater than, sorry, this is A1. R1 dot A1 is greater than 10. Let's just look at the way a computer could execute it. If it's a relational algebra expression, the process will be, actually I will construct a complete cross product. Millions upon millions of rows. Then I will do this selection to find equality and extract only meaningful rows. And then I will apply this selection. After all this suppose, you are only going to get six rows. Then it would be pretty stupid for an SQL query processor to compute the entire cross product, then select only the meaningful things and then do this. I can write this query also as follows. I can say sigma R1 dot A1 greater than 10 of R1, this natural join with R2. Do you get what I am trying to say? I am saying to say that if R1 has let's say 10,000 rows, then don't do a cross product or join with R2 which may have another 5,000 rows. First select from R1 only those rows which I finally want. These may be only 10. What have I done? I have reduced the work that is required to be done by the company. Now in a relational algebra expression, which is actually procedural, I can do it. I can say first do this operation, then do this operation. In SQL, who decides what is to be done? How will SQL know that instead of doing a join first and then doing this is worse, then first selecting only those rows corresponding to this condition and then doing a join. This is the job which the optimizer does. So let's get back to our slides. The query optimizer which is a job of the query processing is a hugely beautifully written set of algorithms which will take an SQL query which will look at one, the data definition of the component tables of that query. Two, they will look at how the physical data is organized. Is there an index somewhere? They will find if there is a statistical information available about the skew of the data and using all those complex algorithms, they will plan, they will prepare a plan that how exactly the final data should be obtained from this query. So query optimizer prepares a plan of data access and executor, query execution engine actually executes that plan. It's a huge, I mean filled with, filled which is rich in research. In fact, Prasad Sudarshan whose book is one of the prescribed books, the database management system book which has written along with Carlton, Prasad Sudarshan is actually a world's renowned expert on query optimization or query processing. Because researchers like him, many of them have worked over decades and the fruits of those research have gone into the actual query optimizer that form part of oracle, sideways, mySQL, whatever, whatever, people like us can automatically get the benefit of that optimal execution. I mean if you are doing a transaction for let's say payroll or if you are calculating the salary of your employees, you don't want the salary calculation to come out at the end of next month, right? You want this month's paid this month. So timing or execution time is important. That's the job that's query processing. Transaction management, we already saw what it means. There is a component of the database which does concurrency control and which provides classical. Last but not the least, it does storage management where it maintains metadata, data indexes and several others. So these are the major components. Who are the database users? We describe first general users. The general users, what do they do? They invoke one of the permanent application program that has been written earlier and they execute it. So a railway reservation clock sitting behind a terminal is a general user. That fellow just logs in. He does not know SQL. He does not know anything other matter. So you have to write an application program such that he can easily execute it and do reservations one after another. A manager sitting across a terminal who is preparing a MIS report. So he will be executing some program which somebody has written. This is a general user. The other is application programmers. Application programmers are people who write these programs. There are variety of different types of application programmers. Application programmers who write simple SQL queries. That can be done by anybody. Application programmers who realize that simple SQL queries are not adequate because while SQL permits set processing. Please note that SQL selects something from this where it is actually taking sets of data from the database table and creating other sets. But if you are required to do individual processing like for example sorting 20 records which is a subset of some other larger set. In conjunction with handling individual records which have come from some other set you require procedural programming. So there is a thing called embedded SQL. That means you can write C programs or Java programs and inside that you can put SQL script. So while your Java program executes whenever it requires data from the database it will execute that SQL program. Then you have to worry about first how will this while you have a Java program or C program how will this SQL program be executed. So that is done through connectivity called open database connectivity or Java database connectivity. Second you have to be able to form dynamic SQL queries etc. We shall see those things in a short while. The layered views of data that are available to us are in this form. At the top are application programs. Then you have the logical view of the data and then you have the physical view of it. This separation, this modularization is extremely crucial in the success of the databases management system being used as the fundamental tool for developing information systems. This separation provides what we call data independence which is a fundamental concept in any database management system. Logical view of the data is the schema. So tables, attributes, indices, views, constraints are schema. There is absolutely no notion of where this data is going to be stored on the days how it is going to be organized sequential relative index or nothing. Nothing is indicated. Physical view is the actual storage structures on the days. Now here is the beauty of the separation. When you write application programs using SQL you need to know only the logical schema because all your SQL statements will be written using that logical schema only. You don't care how the physical data is stored. Physical structures therefore can be modified without affecting the application. This simple thing is not available in the conventional 3GL programming at all. Change the organization of the file, your programs have to change. There could be multiple logical views. Database schema provides a single logical view. There are 200 tables. The number of attributes all across these tables is let's say 3000. Some table has 10 attributes, some has 15, some has 5, whatever. This is the logical view. But if I want to create a subset of this view for some users for example I have the database of students as we said and we said that anybody should be able to look at the roll number name hostel number and room number of a student but not CPI. So I want to create a subset of the view by projection. Saying only these columns can be read by anybody. Now if I create that as a view and define that view there's a procedure to defining that view. It is possible for the DBMS to treat this view as if it's a logical table. And then I can write queries on that table. And therefore I can write application programs for when user, user, user you can say. So these users will actually be able to see only this view. These users will actually be able to see only this view. We shall see this in greater detail. Then you have logical database schema and then you have physical data on the disk. So how application programs work? First you can write programs which are just a square statement. Let me give you some example. These SQL queries are issued directly to the query processor of the DBMS. Say from a terminal. So of course you have some operating system interface by which you invoke DBMS. It will give you a prompt which is a DBMS prompt. Say give me a SQL query. So you type the SQL query. It will send it straight to the query processor including optimizer. Run that query and give back. You can have 3GL programs which can embed SQL statements as I mentioned some time ago. And alternately in advance stage the programs can use application program interfaces through which SQL calls can be sent to database. You can write open database connectivity, Java database connectivity. These are the APIs or application programming interfaces. API stands for application programming interface. These are interfaces means these are function calls which are provided by the DBMS just like operating system provides a function call. In your Fortran or C program you call that function. You don't know how that system works. For example you want today's date, system date. Now there is no way your program can generate today's date suddenly. It will be different every time you run it. So there is an API which says go back to the operating system and from the computer's clock get me today's date. So such APIs are provided by DBMS for such program. This is the architecture of DBMS. Users sit outside the DBMS. They execute this application program. The application program itself in turn would issue SQL queries whenever required and these queries will be interpreted by the DML processor and optimizer. For this the application program or actually the DML processor can interact with the DDL interpreter. What is a DDL interpreter? You have used data definition language to create schema and metadata and everything. We shall see where all this is stored. But the query optimizer will like to know what is the schema for this table. What is the schema for this attribute comes from which table. For that it will interact with this DDL interpreter which is a piece of program which will look at your schema definition and gives the necessary information to be optimized. All of this is the top layer of the DBMS architecture. Below this is a layer where you have transaction manager, your authorization manager and you have file and buffer manager which actually manages the low level physical desk which you have on the which contains the data. And below that is your physical data. So it will include data, it will include metadata which is called data dictionary. Data dictionary is the term we had seen in the context of early modeling. That means you define these many entities, these many attributes but in the data dictionary you define several additional things. That was the notion of a data dictionary there. Here the way metadata is organized that storage is called data dictionary for the DBMS. And in fact every database management system provides the complete schema of the metadata storage itself so that you can write SQL queries to find out how many tables there are, what are the attribute names, what are the attribute characterization etc. So what is a query processor? Query processor firstly understands SQL queries using schema and user authorization information. So query processor actually interacts with those models. It works out an optimal plan for executing the query. Remember I talked about query optimization. What is the best way to answer this query? We choose indices and statistical data and then finally executes the query and sends the results to the application. So an application program if it contains an SQL statement, the application program when executed will send this SQL statement to query optimizer who will optimize, find a plan, execute the plan, get the results and give them back to the application. Transaction management we have seen that. A transaction manager ensures consistent state of data in a database when transactions are executed concurrently. For that you use concurrently control manager. In case of a disk or a database or OS crash, the transaction manager ensures consistent state. We have seen this. So this is a separate module which is called a transaction manager. Storage management. Storage management is a generator. It handles efficient storage retrieval and updates to the physical data. It manages files, indices, data dictionary, buffers, everything. And it ensures authorized access to data. So authorization management. Transaction manager is actually part of the general storage management because you require that to just ensure that the storage of data is consistent. Then in case of a crash you can recover or in case of multiple people handling things, you still give the facade as if there is only a single fellow acting at anyone. All this is handled by the storage manager. Here in short is the DBMS architecture. Revisited. You have users executing application programs. The query processor is nothing but a DML processor and optimizer including the execution of it. For the execution it will send the plan back down. That will go through the transaction management, authorization management and will use the plan to be executed using the storage manager which has information about files, buffers, indices. Finally the data will be accessed here through the physical storage manager. These are the logical layer. Is it clear now how roughly? Of course we are not going to re-look at this. This is for our understanding. Whenever I say I am executing an SQL query, what is happening inside is all of this.