 I want to spend a little bit of time motivating what is it that we want to achieve as part of the course what our database is about and get a view of the outer level starting from what people make use of why a database is important from the end user perspective down to the interval. So, what is the database? It contains information about a particular enterprise or about a particular task or about something it is a collection of interrelated data. Now, this is important a single spreadsheet is not a database because it just has maybe one sheet or a few sheets which are not connected in any deep way. But in a database there is a lot of connections between information and we will explore this in more detail as part of the entity relationship model which focuses on entities and relationships between the entities those are the connections. A database also has a set of program to access the data and an environment for doing so which is both convenient and efficient. What do we mean by convenient? People should be able to express their data needs without writing a whole lot of code efficient in the sense that things should run very fast without requiring too much input from the programmer. Now what applications are built on this? Well all of you are familiar with this today 20 years back not too many people knew what all things are done with databases. But today you have airlines, trains, reservation, schedules, you have registrations in grades in university, you have shops which sells stuff which record everything that they sell in a database. They have what products they are talking, you know have orders being fulfilled and so forth. If you have any organization you have to keep track of employees or students, we have to keep track of their records, their salaries and so forth a whole human resources management aspect to it and there is many, many, many more the list is endless. Now focusing on a university which is going to be a running example throughout the score those of you who used an old edition of a book know that we used to have a bank database. In the last edition we rewrote huge parts of a book to switch to a university database for multiple reasons. The biggest reason is that this is an application which students are intimately familiar with and we can easily motivate things which are going on because that is something which students can see is important. Among the application programs that are needed in this context are adding new students, instructors, new courses. We want to register students for a course. We want to generate class rosters who all is registered. We want to assign grades or marks for students during exams at the end of a course. We want to calculate transcripts. We want to have maybe feedback during the course with clickers and so on. Now what is the role of a database management system? Even before databases were used, people were using computers for doing data processing. Back in the days with magnetic tapes, data processing was used extensively. In fact, computers really boomed for two reasons. One is military and scientific applications, but the bigger reason was commercial applications for data processing. And all of those early things were based on files on tapes and later files on disks. But working directly with files has many drawbacks. You really want to use a database system. You want to avoid many of the problems which arise when you store data as files in a file system. The first is data redundancy and inconsistency. You have, if you can store data in files in any format you want, people are ingenious. They will come up with every possible format and each file will have its own format. So for reading each file, you have to figure out what format that is in and then read it in, write a program to read that particular format. And then people will store the same information in multiple files. It's very hard to pull information from different files together. And for every single task, you have to write a program, which means it's difficult to access data. Data isolation in this case means that there are multiple files and formats. So combining data from different files becomes harder because you have to get it from different formats. And then there are integrity problems. Let's say that every student should have a department name. What if you put a student in the department 1234, which is presumably not a valid department. That's not a valid department name. So when you store the data, you should make sure that the data is clean, that it does not have such errors. And this is part of integrity constraint enforcement, which a file system does not support at all. You can store anything you want in files. Whereas a database management system understands that the data underneath is not just a bunch of bytes, but it has more semantics and it should satisfy certain constraints. So this is one of the major reasons for building a database system. Incidentally, in the world of big data, what happened is that people said that database systems are either cheap and free like MySQL or PostgreSQL, but then they are centralized. Or if you want to buy a parallel database system, they are very expensive. They cost millions of dollars. So people went and built their own homegrown, in some sense, data processing systems. We are going to be looking at one of them called the map reduce paradigm. Actually, it's not just a data processing or database paradigm. It can do many, many other things. So it was built for many reasons other than data processing. But the major use for it, again, it turned out was data processing. And people went and built file formats to store data and programs to access data. And they have run into all these problems, which you see in this slide. All of this has seen history repeating itself in the past five years. And guess what is happening today? The next step beyond map reduce and plain Hadoop is to build database systems, which are massively parallel, can handle big data, but provide all these features, which you see in this slide. So that is actually an ongoing area, and we will talk about it when we talk about big data. A second part of what database systems do is to deal with failures of different kinds, either because systems crash in the middle of processing something, or because multiple people try to access data, and then bad things can happen, when two people try to write the same thing. Or even if they write different things, but read each other's data and then write some other things based on what they have read, you can have inconsistencies. So the two major issues here are atomicity of updates to make sure that either a whole transaction happened, or things are cleaned up so that it appears nothing happened. And to control concurrent access so that bad things don't happen. Now we are going to look at both of these issues in more detail next week. And I will not go deeper into it right here today. And the third aspect which a database system supports is handling security by controlling access to data. So that's what a database management system provides. Now for a little bit of terminology, this is something which we will be using repeatedly. We have what is called a schema, which is the logical structure of a database. For our purposes, when we talk with relation, talk of relations and so on, the schema is a set of relations which are there. The set of attributes that each relation has. The set of types or the types of each attribute. The integrity constraints that are enforced on these relations. All of this is part of the schema for a relational database. There are other data models and corresponding schema for those models. And this is part of the logical schema. Now this is, it's important that a database system give a logical view of all this, because there's also an underlying physical view, which is exactly how is this data stored in file. And you want to separate out these two concerns. You do not want the person who is writing the application program to worry about details of exactly what is the format of the data in the file. What storage structures are used? What indices are available to execute queries efficiently? Because if you burden them with all this, it's a lot more work, first of all. And second, if you make any change at the physical level, you have to go and rewrite program, which causes a lot of disruption. So one of the very important things which databases provide is the upper level, this logical schema is shown through a query language SQL typically. And the lower levels are hidden from SQL. In fact, the SQL standard very carefully tries to avoid details of lower level implementation. So that the upper level is insulated from the lower level. Now what is the good thing about insulation? You can go change the lower level without changing the upper level. Now, somebody has to mediate between the upper and the lower level. Who does this? In the old days, it was a programmer who needed to understand both the levels. Today it's different. It's the query processor, the query optimizer, which bridges the gap between these two levels. And the very first query optimizer, which was built by IBM, was one of the key reasons between the explosive growth of database, a relational database system over the predecessor. It allowed the separation of concern and let the system figure out the best way of evaluating a query. And once people understood this, no, before this there was a problem. People thought that databases would be slow. Interpreting SQL and executing SQL queries would be slow. And in fact, they were right. But that is assuming you have very highly paid programmers who understand exactly how to write complex queries, understand detailed structure. And it takes them time. You have to pay a lot of money. But then money matters. Hardware is cheap, relatively. It was very expensive in those days. But IBM believed it would become cheap and did this. And in fact, that panned out. So today, it's very important to have efficient application development. And efficient execution of queries, but in between is the query processor and optimizer, which handles all these details. So this aspect of separating it out is known as physical data independence. Now moving on to the next thing, we have a number of data models. What is a data model? It's a collection of tools for describing data, relationships, semantics, constraints, and so on. In the first part of this course, we are going to, in the first three days, we are going to focus on the relational model. But it turns out a higher level model called entity relationship model is actually very nice for doing initial design before we move to the relational model. And so that's the next thing we will study on day four. There are other models to object based data models, object oriented, object relational. We don't actually have too much time in this course to cover it, but I'll mention some of these briefly. And then there's XML, which is a data representation more than a model. And there's also JSON and other things which people use these days, which are used to represent data, both for storage of certain kinds of data, such as documents, and for exchange of data between applications. For example, your web browser and the back end need to exchange data. They often do this in semi-structured formats, such as XML and JSON. Again, we don't have time in this course to go into detail, but what I want to emphasize here is that the relational model is not the only way. Even though we are going to spend the most time on it, it's not the only way to represent data. People do use other models. So what is the relational model? It's actually a very simple model. It views data in the abstract as a set of tables. A table has column and it has rows. And if you see the header column, it has names for these columns. And the header row has names for these columns. And the other rows have data. So here's an instructor table. The instructor table has four attributes, an ID which uniquely identifies the instructor, a name, a department name, and a salary. This is artificial. This is not what you would use in a real database system because salaries change and you probably may store it separately from the instructor relation. Department name you might store as a department code. You might store other information here, such as grade. Other stuff can come in here. But we keep it simple because we don't want to deal with too much complexity in an academic course and it complexity will distract from the principle. So we are using a simple design here. Now we will also use terminology such as these columns. The names here we will call as attribute name. The columns are also called attributes while rows are also called tuples. This is part of the legacy of the formal definition of the relational model, which use these formal concepts. Whereas the practical world use different terms and I'll come back to this later. Now relational database has multiple relations. We have two out of several relations shown here. The first is the instructor relation. The second is the department relation. You can see there is a connection between the two. The instructor relation has department name. Here for example the department name physics over here appears in the department table here. Now every department name should appear here otherwise we have a problem. That's an example of an integrity constraint. So that is a quick summary of the relation model. We are going to spend a lot more time on the model coming up after the break. But before that we also need to talk about languages to interact with the database. There are actually two aspects of this. There is a data manipulation language which is how do you write queries, how do you perform updates and so on. And then the data definition language which is how do you specify the schema. Now you can do the data manipulation either procedurally through a language such as Java or C or other languages. But what has been very successful in the database context is declarative or non-procedural language. In particular SQL has had tremendous success for the following reason. It abstracts away what needs to be done. And it enables the underlying system to optimize database access. If you wrote the same query in let's say Java or C, C++ or whatever. You're specifying what needs to be done for the database to understand that hey, maybe you have written it in one way. But there is a more efficient way to do it, it's very hard. In fact, that is an active research area for many years. And it's something that I have had a PhD students both past and current working on. It's a hard problem, believe me. It's a lot harder than taking an SQL query and optimizing it. So SQL has had tremendous success because it's a simple language. Not burdened with too much complexity which makes optimization a lot easier. And that led to it's success. And then there's the data definition language which is exemplified by this code sample here, create table instructor. And then we are saying ID is care five. Name is where care 20, I'll come back to these things later. Salve is numeric and so on. So what happens when you give this command to a database? It actually creates a table in the database and it stores all the things which you have specified here. The name of the table, the attribute name, the types. And later on we'll see constraints as part of a data dictionary. So all of this is stored in the database and then you can view it later on. And then when you store data, the database system makes sure that it has the right attributes, it has the right type, it satisfies the integrity constraint. Now, the next step after doing all this, we are going to see how to manipulate the database using SQL, the data manipulation language. On days four and five, we're going to look at database design, which is how do you decide on the schema? What are the relations that we want and how do we do this? We have to understand the application. We have to understand the needs of the application first before we can start off in the process of launching of a database scheme. So there's a business part of this. What attribute should we record? What are the real world constraints which we need to enforce? And then there's a computer science decision of how do we take the information that we need to store and break it into a set of relations, which are suited to storing that information. So the first part is something which you will be working on as part of your project. And the second part, the computer science decision is what you are modeling and normalization is all about. There's another level which is the physical design. We won't be able to do much of it in this course. But when we talk of query processing, we will be talking about indexing. We will look at this more at that point. Now, talking of database design, here is a small schema, which has ID, name, salary, department name, building and budget. Now, what are building and budget? These are things which are associated with the department. In this case, in our simplified world, a department has a building and it has a budget associated with the department. Now, this particular table has combined all these pieces of information into a single table and look at what has happened. So here we have Einstein who is in the physics department. But if we scroll down here, we have Gold who is also in the physics department. Now, the building is an attribute really of the department. It has nothing to do with the instructor in our model. In reality, the computer science department here lives in two buildings or three buildings and the building might be associated with instructor. But in our simplified model, it's associated only with the department. Like I said, what constraints there are depend on the real world. It's not independent of the real world. So in our artificial real world, the department should have a unique building and unique budget. If physics appears again down here, it better have the same building and the same budget. But once you have a table like this, there is a potential for mistakes. The moment you replicate information and replication is called a redundancy here. The moment you have redundancy, initially things may be okay. But tomorrow when you update the budget of the physics department, somebody may go and update this one but not that one. And now not only are you storing data unnecessarily, but you no longer know which is the correct one. The budget was set to 80,000 there and 70,000 here. You don't know which is the right one. You have two things. You have no idea which is right. So that can lead to problems. As I said, we are going to look at entity relationship models and normalizations. We will come to it later. Now the second part of the course is on database system internals. And there we are going to be looking at three different levels. The bottom level is the storage. How is data stored on file on disk and so forth? Then there is the storage manager which deals with the interface with the actual storage and bringing data from disk into memory and so forth. And then there is the query processor up here which takes queries submitted by users or applications and figures out how to interact with the storage manager in the best possible way to answer the particular query or to perform the update. So don't worry about the details in here. I put up this slide to indicate that there are a lot of issues in how to do this. And we will be looking at it later. So to wrap up this first session of this morning, I want to spend a couple of minutes on the history of database system which shows how this has been evolving and what is happening today. If you go back to the 1950s and 60s, those are the days when data processing had a huge boom and people used magnetic tapes to store data and big corporations which had thousands or tens of thousands of employees, for them this was a godsend that led to companies like IBM and many others which have vanished now becoming really large companies. But IBM led a big change in the next generation of database system. In the 1960s and 70s, IBM pioneered the use of hard disk. Before that, in the 50s and early 60s, hard disks were either not used or their capacity was very, very, very small. They were no good for data processing. But IBM pioneered larger hard disks which now became viable for storing data about employees and so forth. And they pioneered new ways of storing and accessing data on hard disks which led to data models called the network and the hierarchical data model. It turned out these required a lot of work to program with an access and so forth. And state card said that look this is not the way to deal with data. We need a higher level view of data and he proposed using the relational model. He did not just propose, you know, say here are a few tables, but he also came up with first the whole theory behind this whole thing. He developed the theory of the relational model including normalization theory and other related tools. And he also pioneered the use of a high level language. At that time it was called SQL for structured English query language and he went around evangelizing it saying that this is the way you want to interact with data. And that led to a lot of people thinking that hey maybe there is a new way of doing things and we should move beyond the old model. And this eventually with the help of many others at IBM who built the first relational database system in the first query optimizer and later people at Oracle and the University of California at Berkeley and then many more other led to the explosion of the relational model which overtook everything else. And Ted Card won the ACM Turing Award which is the Nobel Prize in Computer Science for this work. And along with this people started building high end mainframe systems for high performance transaction processing. Adline booking became a big thing at that time to allow worldwide booking of tickets from anywhere in the world through a terminal. This was revolutionary. This was not something which was possible in an earlier era. So there was a boom in this system. In the 1980s relational research prototypes evolved into commercial systems Oracle being the dominant one at that time. Even now it is dominant. SQL became an industry standard. Parallel and distributed databases took off and they have widely used these states. And then there were many people who started looking at other data models, object oriented databases, object relational databases. It turns out that some of these have found a niche but they never overtook SQL in the way people thought they would. So that was the 80s. In the 90s parallel databases started taking off and there were companies which really required many, many machines running in parallel. They built multi terabyte parallel data warehouses. Warehousing in fact became a big topic. Online analytical processing became a big topic in that time. And databases became very important for decision makers and companies. Earlier they were used to record things like airline booking. Now people are using it to decide what products the company should manufacture, what products different places should stock, try to predict what will the trends be in the next year so that the right things are made and stock and so forth. So decision support became a big business in this era. In the 2000s there was an emergence of XML and Xquery standards. People thought that that would supply relational databases but that did not happen. But something which did happen is that automated database administration took off and later in the 2000s big data really took off. What do we mean by big data? I would view it as any data that is so large that it cannot be managed on a centralized system or a system which just has a few CPU. It requires many, many machines put together to handle that volume of data. Now different people have different definitions of big data. I do not want to get into those radius definitions. But to me it is anything which is big enough that you need massive parallelism. And people built tools for handling big data. Initially using file system and map reduce Hadoop but later other systems like big table P nuts and so forth. So that is a summary of what I wanted to cover in this session. With that we will end the first session. Thank you and see you again at 11 o'clock sharp. Please don't be late. Thank you.