 This is Introduction to Database Systems. Those of you familiar with the class may know that it's usually taught by Andy Pavlo and as I'm sure some of you may have realized, I'm not Andy Pavlo. My name is Andrew Crotty, I'm a post-doctoral researcher here at CMU and my research focus is mainly on database systems and data management. You may also have realized that my co-instructor for the course, Lin-Maz, also not Andy Pavlo, he is a post-doc here as well and he's one of Andy's former PhD students. So at this point you might be wondering where Andy is and he's actually on leave this semester because he's filming a Netflix documentary that's a follow-up to last year's Tiger King. Unfortunately, I can't say anything more because of the NDA that I signed, but you should definitely check this out when it's released. So with that, let's move on to the course material. One last thing before we begin, I want to thank Google BigQuery for helping us with course development this semester. I'm sure all of you know what Google is. But BigQuery specifically is a database product. It's a fully managed cloud data warehouse, we might call it Platform as a Service. And essentially you give them your data and then you can query it, run all sorts of analytics, machine learning, that kind of stuff. And at the end of the semester, there's going to be someone from Google who comes here to give a guest lecture and tell you about kind of all the cool stuff they're doing. And you'll be able to see a lot of the topics that we covered in the course actually applied in practice. And they'll be able to talk more about kind of the problems that come up when dealing with data at this scale. So today's agenda, we're just going to go over the kind of beginning, what is this course, course logistic stuff. And then there'll be sort of like a mini half lecture at the end where we'll just kind of go through some of the core concepts that we're going to cover in the course. So depending on timing, we'll get out a few minutes early. To start off with the wait list, I do not control the wait list. It is way above my pay grade as students are going to be moved off the wait list based on your position as new spots in the class open up. So if you send us emails, there's nothing that I personally can do about it. It's just going to be based on order. And if you're not currently enrolled in the course, the likelihood that you'll get in and things will probably firm up over the next week or so. But unfortunately, you might not be able to get in. But we do plan to make all of the course materials public. We're going to put the lecture videos on YouTube. And all the material will be there so you'll be able to follow along. In terms of lecture rules, please interrupt me at any time. If any of these things happens, either I'm speaking too fast. If you don't understand something that I'm talking about, or if you have a database related question, chances are that if you have a question, then someone else does too. And if this is kind of a big room, if I can't quite hear you, or if I don't see you right away, just try to interrupt me at a good point where I can take a question. So just at a high level, this course is specifically about the design and implementation of database management systems. So that means we're going to be looking at the software that manages databases. So this isn't a course about how to use a DBMS to build applications. So imagine if you had a building a web application and you wanted to store your user data in a database, this class is not about that. And it's not a class about how to administer a database system. So there are all sorts of jobs out there where people's entire roles to essentially manage the software that manages the data. And that's also not the topic of this course. So there are other courses, unfortunately, one of them is not being offered this semester. And I don't know if it's offered anymore. But that fun course is the one that covers those topics. So just at a high level for the rest of the semester, the topics that we're going to be covering are going to be starting with a relational model, and that's a data model for databases. And that's going to be the main theme for the entire course. So we're going to talk primarily about relational database management systems. There are all sorts of other data management systems, no SQL. Some of you may have heard of graph databases, all sorts of things. We're going to be focusing on relational databases in this course. And we're going to sort of build up from the bottom level all the way up through the stack to look at the different parts of a database management system, starting with the storage, how you actually physically store the data in memory and on persistent storage, hard drive, and SSD. Then we're going to move up and talk about query execution and kind of how you can either put data into, update the data that you've stored or retrieve the data. Then we're going to move on to kind of concurrency control. So imagine you have several processes that are concurrently trying to access the database, how can you sort of keep all of those together and manage them. The next piece that we're going to cover is recovery. So if your database crashes or if the machine crashes, how can you kind of make sure that your data is safe and persistent and you can kind of recover from those types of situations. And then towards the end of this semester, we're going to cover some more advanced topics, distributed database management systems, and kind of just a general review of some advanced topics. So in terms of logistics, the course policies and the schedule are all on the web page. Academic honesty, I'm going to talk about this in later in the slides. But please don't do anything stupid if we catch you plagiarizing or copying work or those sorts of things, it's going to be a big problem. I'll have to do a lot of paperwork and it's going to be a big problem for you. So just please, it's easier if you don't do it, if you have any questions, if you're not sure about anything, please either send an email to me or Lynn or one of the TAs and just ask, like, is this thing that I'm doing okay? And you can also refer to the CMU policy page. Ironically, the link there is broken. I checked this morning, their web page is down. But there's a cached version somewhere or I can make it available to you if you email me. So all of the discussion and announcements for the course will be on Piazza. And if there's something that you need to contact either me or Lynn about just privately, a sensitive subject, please send us an email and we can coordinate something to speak. So the textbook for the course is this Database Systems Concepts book, also known as the Sailbo's book, there are. So I'm not sure about the exact difference between the sixth and the seventh edition, I think all of the material that we're going to cover in this course is available in the sixth edition. If you have that or want to buy that version. I don't think that the seventh edition is available as a hard cover book anymore. I think it's loosely pages that you get something with a free whole punch in them. But we'll also be providing lecture notes that we'll cover all of the topics as well as things that we cover in class that aren't covered in the book. So in terms of grading, this is going to be the breakdown for how your grades are calculated, 15% for homeworks, 45% for projects, 20% for the midterm exam, and 20% for the final exam. So it's pretty heavily weighted towards homeworks and the projects. For homeworks, there are going to be five assignments that are spread out over the course of the semester. The first homework is a SQL assignment. So SQL is, we're going to talk about it a little bit at the end of this class. And in the next class, it's a query language that you use to interact with the database management system. So the first homework assignment is going to be able to write some SQL queries. To answer some questions from a sample database, we'll give you and the rest are going to be paper and pencil assignments where you kind of apply formulas or solve different conceptual problems for topics we cover in class. And again, I mentioned earlier, all homework should be done individually. You shouldn't be sharing answers or doing group assignments. They're all individual assignments and should be done individually. We'll be checking for kind of copied solutions and things. So the projects are focused around this idea that you're going to build up your own database engine over the course of the semester. So like I said, we'll start at the lower levels and we'll work our way up to the stack. But what this means is that each project is going to build on the previous ones. So for example, you have to implement a storage manager earlier on. The later projects may interact with that storage manager to get data in and out. So for that reason, it's important that you kind of keep up and go along if you get too far behind on these early projects. Then it's going to be really difficult or impossible to catch up later on in the semester. So an important thing, and this was kind of already mentioned in the announcement for the first project. But we're not going to be teaching you how to write or debug C++ code in this class. It's a prerequisite for the course. We expect this is a higher level course. We expect that you know kind of C++ as a prerequisite. So if you go to TA office hours and we're asking things like what is the stream? Or what does the shared pointer, that kind of stuff? We've told them not to answer those kind of questions. So we expect you to be comfortable working with C++. That's kind of why this project zero is released. And it needs to be completed by September 13th for you to continue the course. So it's kind of meant as like a self-check kind of thing. Where if you're really struggling with project zero, then this might not be the course for you. Because later on, you're not going to be able to do the more complicated programming projects. So all of the projects for the course will be implemented using the CMU DB Group Bus Hub Academic DBMS. It is kind of an academic system, so it's not really a full-fledged database system. But it has these kind of modular pieces and we'll go through some of these things and make more sense later in the course. But it has kind of this disk-based storage architecture, where as I said, it can read and write data from disk. It has a volcano-style query processing set up where you can kind of chain together iterators basically to iterate over your data and answer queries. It has plug-able APIs that will leverage to be able to implement the different pieces of the system that you're going to implement for the projects. And it currently does not support SQL, so you won't be interacting with it through SQL, you'll kind of be interacting with it through lower-level test programs that. Okay, so the next thing is the late policy. You lose 10% of the points for a project or homework for every 24 hours late that you handed in. So if you hand it in after the deadline, we'll round up. So if it's four hours late, then it's going to be 10% off and we'll round up the next full day. You're going to have a total of four late days over the course of the semester to be used for projects only. So not for the homework, it's only for the programming projects. And you're free to allocate them as you wish. For example, you could turn in one project four days late or four projects each one day. The only thing that you cannot use them for is the project number zero that has to be done by September 13th for you to continue the course. There's a pandemic still going on, I know that there's a lot of uncertainty and things that can come up. So we're also going to grant no penalty extensions due to extreme circumstances, like a medical emergency or a family emergency. Just if something comes up, I realize at the time you might not be able to, but as soon as you can, just contact us and see what is going on and we can try to make an arrangement for you. Okay, so again, I want to reiterate the plagiarism warning. I know this is redundant, but I really want to get this out there. And so you've all seen it and there's video of me telling all of you about it. So later on if we run into any problems, there's no excuse for not knowing what this is. So the homework and the projects must be your own original work. They're not group assignments, they're not meant to be done together in groups. You cannot copy source code from other people, your peers or the web. In general, there's not a, since this is kind of an academic system and we change the projects each semester, there shouldn't be like one-for-ones solutions to the projects that we're going to give you online. But I still, if there's any source code out there, please don't copy it. We're not going to tolerate plagiarism and it's going to be a big problem. So please do not do it. So kind of the last administrative thing is office hours. We're still waiting on a clarification from the university about what we're supposed to do for in-person versus remote office hours. So you may notice the website does not have either the instructor or TA office hours yet. But as soon as we get this, hopefully in the next day or two, we will update you. So we'll put it on the website and we'll send out a message to the class. If you need to contact us sooner about anything at all, please just send us an email or you can reach out. If it's something that's a more general question, you can reach out on Piazza. Okay, so for the fun stuff, database research, there is a vaccination database tech talks seminar series that will take place, I believe on Zoom Mondays at 4.30 p.m. starting on Monday, 9.13. There's a link to this schedule there. You can access it. And we are going to have a whole bunch of excellent people from each of these database companies or projects come and talk about their system. And these aren't like marketing people that are just going to say, use our system. They're technical people that are going to come and give technical talks about the actual internals of their system. So low-level stuff, kind of the design stuff that we're talking about in this course. So if you're interested in that, please check it out and you can check out the website there. So that is kind of all of the administrative logistics part of this lecture. So if there are any specific questions so far, I'd be happy to answer them and then we can move on to the fun database stuff. Okay, and again, all of this stuff that I mentioned should be available on the website either currently or, as I said, in the case of office hours and stuff shortly. So you can check there if you have any questions, just send us an email or a response. Okay, so databases. Let's start with, what is a database? Just out of curiosity, how, like you can just raise your hands, how many of you regularly interact with a database management system databases? It doesn't have to be like command line, writing SQL queries, no one, no databases, all right. I find that a little hard to believe. So SQL Lite is a database management system. It is probably the most widely deployed database management system in the world. Pretty new to have a smartphone, it's used in iOS and Android. Web browsers, I don't know if you use web browsers, there's Chrome and Safari both use SQL Lite for storing data. And if you've ever made a call with Skype, Skype uses SQL Lite behind the scenes. So I guess the point I'm trying to make is that pretty much behind every application anywhere where you need to store data, there's probably a database management system running in the background, can't wait, thanks for you. So a database is essentially an organized collection of interrelated data where you're trying to model or capture some aspect of the real world. So if you think about like a real world entity, for example, if I have a database to model students in the class, each individual student is a data point that would be stored in my database. So just as a practical example, let's create a database that models a digital music store or something like Apple Music Store, iTunes Music Store, or something similar where you want to keep track of, for now to simplify things, we'll just keep track of artists and the albums that they release. So the things that we're going to need are, you know, to know about individual artists and kind of a record of over time the different albums they release. And the way we can do this just as a strawman sort of example is using fly files. So imagine that we want to store our database in just a comma separated value CSV file kind of where each row or line in the file represents one, you know, data entity, so an artist or an album and kind of each column is going to represent an individual attribute about that entity. So, you know, we don't know anything about any kind of, you know, database management systems that are out there like MySQL, PostgreSQL, SQLite, any of that stuff. And we don't know about that, we're just kind of building an application up ourselves. And what this is going to require is that the application needs to parse the file each time it wants to read anything and it's going to have to, you know, read a line, split the CSV and kind of figure out what's going on every single time we want to access the data. So, again, here's just a simple example using the music store where we have two essentially tables stored as CSV files. So there's the artist table where we have the name of the artist, the year they first started releasing music in the country they're from. And in this second album table we have the name of the album, the artist that released it and the year it was released. So now let's say that we want to issue essentially a query on the data that we have and we want to know the year that IceCube went solo. I don't know if you know that IceCube had some disagreements about the amount of money he was making so he decided to go solo and start releasing albums on his own. So I guess the way that we would do this if we have these CSV files is you write some kind of like a Python program that looks something like this where essentially you open up the file and then you iterate over a reliant file. Now what we have to do is take the line, remember it's just a CSV string, you have to parse it into a record and then what we want to do is check if the value at position zero in the record is IceCube because we're looking for one IceCube went solo and you want to print out the year so we're going to cast the second value of the string between hints and return that. So I guess what's the problem with this? It seems pretty simple, pretty straightforward, I mean I guess your query could change, maybe you want to know when Natura's BIG went solo or sort of releasing music, you could just swap out IceCube for a different name and get a different result or if you want to know the country that someone's from just change the position that you're printing out there. So I guess this seems fine but can anyone think of any issues that might come up if you're managing things this way? That is true so if the data gets really large then what you're going to have to do is kind of open the file and iterate over all of the lines here. I guess what you could do is modify this a little bit to as soon as you know that there's only one record you're looking for in this case, you could modify it a little bit to just return as soon as you find it but that's certainly true. If you're not careful with how you code this then you might end up iterating over every single line in the file every single time. That is another problem that can come up so that the statement was that there could be duplicate entries for IceCube and there's no way to ensure using this method that someone doesn't come in and you know put in another record for IceCube and there's no way to maintain kind of the invariant we want, only one record to represent IceCube, yes. That's another problem. Any others? I have a whole list I can go through. Okay so the first problem or the first I guess series of problems relates to data integrity so one problem might be is how do we ensure that the artist is the name, the name of the artist is the same for each album entry. So if you remember in order to figure out you know kind of which artists, several ways to which albums we store the artist name in the album CSV file so imagine that you type in a wrong name to update one of the album rows or an artist changes their name or that sort of thing so how do you ensure that there's consistency between the name that an artist lists in the artist file and the name of the artist lists in the albums file. Another problem that come up is you know what if someone overwrites the album year in the album file in invalid strength so in that code I showed you're expecting an integer that represents a year when the album is released. If someone comes in and puts in just a regular string then your code is going to break, there's going to be a problem in parsing it. And again there's no kind of I guess you could have access control on your files but you know once someone's able to get in the file they're just making a mistake including the invalid value for a particular field. What if there are multiple artists on an album so everything I showed you know each album only on an individual artist so it's just one field in the CSV string but you know imagine that there's a multiple artists on an album like a mix tape or something. So in that case I guess you know you could change the format instead of being a single string you could put you know a comma separated list inside the field of the CSV file but then you're you know complicating your parsing code. You have to go back and adjust your parsing code to now handle you know potentially one or more values in that field. And kind of the last problem and these are just a few that I you know came up with here but there are many many more but what happens if we delete an artist that has albums. So imagine that you know you deleted the record for Ice Cube and the artist file and now he has a bunch of records or a bunch of records representing albums in the album table and now there's no artists committing that they wanted to. Okay so that's kind of the date data integrity side from an implementation implementation perspective and this was one of the points made is how do you find a particular error so again example. You end up in the code that I showed iterating through every single row every single time. You want to find something you there are all sorts of ways to optimize it for example if you know the data is sorted you could do like binary search or something. Or if you have a data structure that stores it maybe like a hash table or something you could index directly to the record you're looking for but just you know all of these things you're building up additional complexity in your Python program that you have to write to access the code so all of that kind of implementation efficiency needs to be taken into consideration when when you're deciding your application. Similarly what if we want to create a new application that uses the same database so the database is stored in the CSV files. Imagine that you know someone else for some other purpose wants to come along and access the same CSV files concurrently. You want them to have to rewrite kind of all the Python code that you've had built up or maybe need to do it in a different language. But the point is that kind of you've you've had to do the work to build up this application and now it has to be replicated every time someone else wants to access the same data. And even for you you know like I said if you want to change the query or change things that you're looking for you run into the problem of having to kind of reimplement things from scratch every single time. So kind of another implementation question is what if two threads or two programs I try to write to the file at the same time. If I'm editing the file and you're editing the file and we both save our work who knows what's going to happen. You know one of us might overwrite the other or maybe both of our rights get partially written and now you have kind of this garbage record in them. So what you can end up with is kind of all of this you have to implement some kind of concurrency layer to manage concurrent accesses to the data. So the final piece that I want to talk about is durability, which means like what happens if the machine crashes when when program is updating records. So you're doing right power goes out or something or the program crashes and you know now what happens to the data is it consistent. Do you finish writing the whole record you wanted to write or is it kind of half written now you have again this garbage garbage value in your your days. What if we want to replicate the database on multiple machines so they can have high availability. How do we handle kind of current rights to different machines. Because now machines need to be synchronized across them and kind of the list goes on and on and on with all of these problems that come up. So that's kind of the reason why we want to build a database management system so we don't have to handle all of this complexity all of those different types of problems that I mentioned in application. Okay so I explain what a database is the software that manages database is a database management system. Sometimes people use the words interchangeably I will try not to because it's confusing but a database management system or DBMS is a system software that manages the database and allows applications to store retrieve and analyze information that's stored in the database. So rather than managing things in these CSV files you can access them through the database and it kind of gives you all of these nice properties that avoid the problems that I mentioned before. And a general purpose DBMS is designed to allow you to define, create, query, update and administer databases in a generic sense. So if you have again an application for managing students in a class or for managing an online music store it's general purpose enough that you can program your applications against the system. So this is really good for like business reasons for your business or startup or your organization because the purpose, the value add that you're bringing me is not that you can store data, lots of people can store data but what a database management system allows you to do is focus on the core aspect of your business rather than worrying about all of these issues surrounding actually managing the data. And DBMSs are widely tested and deployed so that they find all the bugs, the concurrency bugs, the consistency bugs, all of those problems that come up because kind of the chances that every single Python program that you write is going to be 100% bug free all the time are pretty much zero. So kind of the whole point of a database management system is that you can leverage kind of this core set of functionality for managing databases so you don't have to do it manually in application. So database management systems are not new. The first one came out in 1965 at General Electric was called IDM and essentially they were used kind of to manage data but they were really writing applications, they were really difficult to build and maintain. And that's because people were working at a very low level and there was, you know, really tight coupling between the logical layer so kind of the data model layer and the physical layer that like implementation right into this, so kind of the big problem is that you had to know the queries that you wanted your application execute before you deployed the database and they were just really cumbersome and not easy to work with. Kind of the things that we're going to talk about over the course of the semester might seem obvious in retrospect but at the time, you know, people were struggling to build these applications and they were kind of controversial, the ideas were kind of controversial. So there was this guy at IBM named Ted Cott, who kind of observed that people were re-implementing the same things over and over again, reinventing the wheel and making a lot of mistakes along the way so he proposed kind of this high level idea called a relational model. And there is this original, it came out as a technical report in 1969 but nobody reads this one, this is the one that everyone references, it came out a year later in 1970. It's a paper called a relational model of data for large shared data banks. And you know it's old because people, when they wrote the text, they spelled data and base as two separate words and how many have it combined as one. But kind of the core idea that he proposed was a relational model and he actually won the Turing Award in 1981. That is, it's like the Nobel Prize for computing, it's like the highest honor you can receive in computer science and he won it for the relational model. And as I said, it was kind of controversial at the time. You can look at it now and say, oh yeah, sure, that makes sense but at the time it wasn't so obvious. So kind of the relational model at a high level is a database abstraction to avoid a lot of low level maintenance and problems that people are running into writing database applications. So the kind of three main points that we're going to touch on are that the databases should be stored in a simple data structure called a relation, where a relation essentially represents the relationship among attributes stored in a table. So like the relationship, if you are thinking about the artist's example, the relationship between the artist's name, the year that they started producing music and the country there. So the relationship between those attributes. The next piece is that you should access data through a high level language, rather than kind of telling the database management system explicitly how you wanted to get the data. You could just specify what data you wanted and let the database management system figure out kind of the best strategy to do it. Because you know if you're building a piece of software that specializes interactive databases, it should be pretty good at figuring out the best way to get you the data you want. And finally, the last piece is that the physical storage of the data, either at the time on disk or secondary storage. Now memory also should be left up to the DBMS to implement. So you don't have to worry about whether it's laid out in a row format in CSV file. It's just completely abstracted to you. You just know what the data is that's in there and let the DBMS kind of figure out the specifics of storing it. And you know at the time, like I said, people argued that a DBMS would never be able to generate a query plan as efficient as kind of what a human could do. Kind of like how people used to argue about compilers, no one could ever produce code that's no computer program product ever produced code that's as efficient as what assembly a human could write. Now almost no one writes at that level. So kind of it's the same idea. With databases, there was a lot of human effort that went into designing programs to efficiently access it because they thought that you couldn't design a system to do it efficiently. So I mentioned that the relational model is a type of data model. So the data model is just essentially a high level collection of concepts for describing the data that's stored in a database. So it's like a high level abstraction for how we're going to represent the data. Now a schema is more specifically a description of a particular collection of data given a data model. So what does that mean? It means that the schema defines exactly what we're going to store in the database. So for example, in the music store example, the schema would be the artist table followed by the different attributes that are associated with an artist. So that's the schema that you would define that describes the data that's stored in the database. So there are, as I said, lots and lots of different data models. Most DBMSs that you may have heard of are based on this first relational data model. And for a lot of reasons, but probably the biggest one is that the relational model is probably the most flexible of all of them. The relational model can kind of model all of these other, or you can represent all of these other data models using the relational model. And the exact API or what's going on behind the scenes might not be as efficient, but it's still general enough to be able to handle all of these other models. So NoSQL is a popular term. It means a lot of things to a lot of different people, but just at kind of a high level, like for example, it's more than just a data model. A lot of people, when they hear NoSQL, think about things like transactions or consistency or not having certain APIs. But just in terms of the actual NoSQL data model, it covers kind of this key value. Stores, graph, database management systems, document databases, MongoDB, and kind of broader column family databases. And kind of these data models are more restrictive than what you get with the relational model. And they don't give you as many, as I said, kind of guarantees in terms of different properties that we're going to discuss over the course of the semester. There are some, I'll finally say, there are some application domains where these data models might make sense. So imagine, I don't know, you're storing video data or you're storing log data or something. You just want to shove it into a key value store that's perfectly valid and might be better for that particular type of application than a relational database. Array and matrix database management systems are kind of specialized more towards machine learning or scientific applications. There are a few, side DB, title DB, but they're not really widespread and kind of narrowly focused on these sorts of use cases. And in the last group, hierarchical network and multi-value, these are kind of either obsolete data models that people tried out in the past and they found didn't work well or they had other problems. But they might still be hanging around in legacy systems. So for the purposes of this course, we're going to be focusing exclusively on the relational model and how database management system works in that context. So what is the relational model exactly? The relational model defines the relations inside the database. So like I mentioned, the individual tables. Sometimes in SQL, they're called tables. In the relational model, they're called relations. I'll probably end up using them interchangeably. But you can think of them just concretely as those examples from the music store. There's the artist's relation on table and the album relation. And the structure of the relational model defines those relations and the contents inside them. So the next piece, as I mentioned, is integrity, which ensures that the database's contents satisfy some constraints. So with the concern about whether or not there could be a garbage string value inserted or overwritten on an integer year, kind of the integrity aspect seeks to mitigate that problem by forcing all of the year values to be integers and forcing that property. And finally, there's the manipulation aspect, which is the programming interface or API that you use to access and modify the contents of the database. And this is what we're going to talk about later in the lecture with relational algebra. So relational algebra is a programming interface for interacting with the relational model. So again, kind of just going through the example of relation is an unordered set, so that's important. There's no order to the values. We don't necessarily care what order they appear in the database. They could be sorted. They could not be sorted. It doesn't matter. It's just a set. And it contains the relationship of the attributes that represent entities. So as I said, a relation or table has many entries that are called tuples, tuples. Some people say either is fine with me, but I'll probably end up saying both. So tuples or records also in a table. Tuple or record isn't too changeable. There's another word that sometimes people use called rows. I will try not to use that. I might accidentally, but a row implies something specific about how the data is stored. So in the CSB example, you know, every data restored is an individual line. But the tuples and records represent individual data entries in a table. The values are normally scalar values, things like integer strings. That's not necessarily true anymore. The original specification, they all had to be kind of these scalar values. Popular systems have started relaxing to store things like I store an array, maybe in an individual column or like a JSON document individual individual. And there's also this notion of a null, which is a member of every single domain can be any attribute can be set to null. And it's not exactly like null in a null pointer or something, but it's used to signify that we don't know what a particular value is. So for example, if the country that ice cube is from null, it just means that we don't know what that value, we don't know the specific value for that record. So again, just a relation is a mathematical term that represents this unordered set and the SQL equivalent is a table. So in an array relation, if I say an array relation, it means that there's a table with n individual columns and individual attributes. So this is a free columns. Okay, so a big piece of the relational model is that every relation should have a primary key that uniquely identifies a single tuple. So some DMMSs will automatically create this for you if you specify it. They'll do it behind the scenes, kind of think about like an auto-implementing unique integer for keys. So in this table example here, there's not really anything that we can use to uniquely identify. I guess, I mean, you could potentially use the name of the artist to uniquely identify it, but there's no guarantee that all those values are going to be unique. So what you can do is add kind of this surrogate primary key that's essentially just an integer. And like I said, it can be auto-generating or random or whatever, kind of just to keep each record or tuple unique. So foreign keys are related to primary keys in that they specify that an attribute from one relation has a mapping to some tuple in another relation. So just concretely, if we have this artist key here and we have the album key, we know that the artists, so the new primary key ID numbers that we've added to the album table, the artists, one, two, three, seven, eight, nine, reference artists stored, tuple stored in the artist's table. So if we want to know who released one of these albums, we can go look up, okay, what's the number of the artist, and then go look up which artist that is in the artist's table. But there's a problem we can run into, which is an engine earlier, which is what if you want to have multiple artists on a single album. So again, you release like a mixtape or something. We have a lot of different artists to cubic ink, but we only have this one attribute to store the artist ID. And the solution that we caught with in the relational model is to create this third table, sometimes called the join table, which is going to join or link the artist's table with the album table. So what we're going to do is we're going to get rid of that artist column in the album table and instead create this new artist album table that has only two things in it. So it is an artist ID and an album ID. And now if you look at it, we can figure out, okay, here is how all of the artists link to the albums that they released. So in the case where you have, you know, one artist on an album and there's only going to be one tuple in the artist album relation, in the case where you have multiple, you'll have now multiple tuples. So data manipulation languages or DML are the specific methods that we use to store and retrieve information from a database. And they come specifically in two flavors. So the first is procedural and that's that a high level query should specify how the DBMS should compute the answer to our query. So the keyword here is how you're telling the database how you want it to retrieve the data. And this is going to be a relational algebra that we talked about in a couple of slides here. You're saying specifically how you want the data to be retrieved from the database. The alternative is a non-procedural or declarative language where you only specify what data you want retrieved from the database. You know, you don't tell the database management system how you want to get it back. You just say what data you want and then the system goes and figures out how to do that for you. So an example of this non-procedural or declarative language is relational calculus. We're not going to talk too much about relational calculus in this class. It's important, it's really important for query optimization, but it's not really in the cover in this class. Don't worry too much about it. Relational algebra and relational calculus are ideologically equivalent, but we're, like I said, only going to be focusing on this first one. Another example of a declarative language is SQL, which we're going to talk about next class. So relational algebra defines the original specification, defines these seven operators. And they're all fundamental operations that you can use to retrieve and manipulate the tuples in a relationship. These are the fundamental operators proposed by Kotlin. And they're based on mathematical ideas set algebra. So each operator listed here is going to take in one or more relations as input, and it's going to output a new relation. So in order to build a query to get the data you want out of your database, you can kind of chain these operators together to create more complex operations. So we'll just kind of go through each one starting with select. Sorry, is there a question? No. So starting with select, essentially the goal of the select operator is to choose a subset of the tuples from a relation that satisfies a particular predicate that you provide. So the predicate, you can think about it acting like a filter or kind of like an if statement that's only going to retain the tuples that qualify for the user-specified predicate. And you can, of course, combine multiple predicates using conjunctions or disjunctions to get exactly the subset that you want. I think it's called restrict in the original relational model paper, but everyone calls it a select. The symbol is the lowercase sigma in relational algebra. So just as an example, we have this really simple relation here. R, it only has two attributes or columns, AID and BID. So if we want to, for example, restrict or filter the relation to only cases where AID equals A2, then we apply that predicate and we get back just that subset that satisfies the predicate that we specify. And as I said, you can kind of change things together and say, okay, give me all of the records where AID equals A2 and BID is greater than 102. So in this case, you only need to get that one tuple. And then if you're wondering how this works with SQL, I apologize. I didn't come up with the naming convention, but the select relational algebra operator maps to the where clause in a SQL statement. It might be a little confusing, but select in relational algebra maps to where clause. And again, talking about SQL next time, so we'll go through these. But the projection statement is essentially going to generate a relation with tuples that only contain the attributes that you asked for. And what does that mean? You can do all sorts of things like rearrange the order of the attributes. You can manipulate the values of the attributes, do different types of modifications. It's specified with lowercase pi symbol and essentially a list of all the different modifications. So here is just an example query where, again, we're doing the selection first. So I mentioned you can change them together. We're doing the selection first to get only the records that have AID equals A2. And then we're performing a projection where we're going to modify B by subtracting 100, and we're going to swap the order of the ID and AID. And that's kind of the result you get. So again, I didn't come up with the names, but the projection operator maps to the select clause of the SQL statement. So the selection operator maps to the where clause, the projection operator maps to the select clause. The union operator is essentially going to generate a relation that contains all the tuples that appear in either or both relations. It's just like a set union. So for example, imagine we have another second table here, S, which has the same schema as the first one. To perform R, union S is going to give us essentially all of the tuples that appear in R, all of the tuples that appear in S. And the SQL syntax for this is a little bit different. It's union all in order to get potential duplicates because of a difference between set algebra and like a bag or multi-set algebra. So you can see here that there is the duplicate that appears in the union. So in this example, the output is ordered, but that's again not necessarily going to be true because it's a set that could appear in the output relation to appear in any order. And the answer is still considered correct, so you can't count on the ordering error. The intersection operator is going to generate a relation that only contains the tuples that appear in both of the input relations. So it's R intersect S. Again, it's the same example where R and S both have the same schema. And what we want to do is find the intersection. It's going to give us only A3 appears in both R and S. So you can do this in SQL using the intercept operator. So just as an example of when this might be useful, if you think about the music store example, maybe we have one relation representing rap artists and another one representing rock or country artists. And you want to know which artists have both rap albums and rock or country albums. So you would just do an intersection of the two relations to get that answer. The difference operator is going to generate a relation that contains only the tuples that appear in the first relation and not the second relation. So in this example here, you end up with only the unique ones that appear in R and not S. And you can do that in SQL using the accept keyword. Okay, the product is essentially going to generate a relation that contains all of the possible combinations of tuples from the both input relations. So sometimes called Cartesian product is the product of two relations. It's going to be all the pairwise combinations. So essentially what you're going to get from doing the product of R and S is kind of all of these pairwise combinations of tuples from both relations. This might seem kind of useless, but it does show up sometimes, you know, if you want to generate all the possible combinations in two tables. But more importantly, what this is going to be used for is just from a conceptual or theoretical perspective. It's going to let us model the next operator we're going to talk about, which is a joint. So you can get all these pairwise combinations. This is the syntax in SQL. You can get all these pairwise combinations, but now let's say that you just want the pairwise combinations where your primary key that we talked about in your foreign key match. So you have two keys that match. So the joint operator is going to generate a relation that contains all the tuples that are a combination of the two inputs where there's a common value for one or more attributes depending on what you specify. So what we're going to do is we're going to look for matches here. And again, these two tables have the same schema, but you could do it in an arbitrary case. In the music store example, you have the artist's relation, the album's relation, and they share the artist's primary key that they join on. So what's happening here with the joint operators, we're only finding the matches, like I said, that appear in both the R table and the S table. So like I said, this conceptually, if you think about what the product operator does, you could produce all of these pairwise optimizations. So you produce all these pairwise combinations and then filter it down to just the set that matches where the keys match. But of course, you never do this because, you know, you have to first enumerate all the different combinations and then from the filter. So there are different optimizations you might want to do to short circuit this, but from a theoretical perspective, conceptual perspective, it's useful to think about it in this way. So the SQL syntax for this, and again, we'll go over these in more detail next class, but the SQL syntax to do this joint here is the natural throat operator. So over time, like I said, there have been extensions to the original relational algebra where they built it more operators. There's now a bunch that people have added, rename assignments, duplicate elimination and negation, imagine you want to count how many albums a particular artist has released. Sorting, so you can actually impose some kind of ordering on the unordered set and set division. So there are analogues for all of these in SQL that we'll talk about. Don't worry too much about them. They're not going to be super important for the course. So just an observation about relational algebra, kind of we've gone through all the operators, but it's still defining more or less the high level steps that are necessary for how to compute a query. So essentially, we're telling the database management system exactly how it should go about computing query. And in the example here, we have two more or less equivalent queries. On one side, you're doing the joint between R&S first before applying the filter to only the tuples that have BID 102. And on the other side, you're filtering S to only the tuples that have 102, BID equals 102, and then performing the joint. So kind of these two, while they're going to produce the exact same answer, one might be a lot more efficient than the other. You know, if you have, again, a billion tuples in the relation, you might want to do the filtering first. So you get down just one rather than having to, you know, find all the billion matches and then do the filter. So in this way, the relational algebra is still procedural and not declarative. So really what we'd like is a declarative language like SQL for existing the database. And that's kind of this idea where you want to state that the high level answer or exactly what you want the database management system to compute for you, rather than telling it how it should go about doing it. And essentially, you can just leave the low level details about, you know, how to actually do it to the database management system. And if it's well implemented, it should do it a lot better than the high level procedural query you would produce. So the relational model just as a concept is independent of any kind of specific query language implementations. There are a lot that got proposed over the years. I think there's Alpha proposed by Cod, but no one talks about that anymore. There's Quell, which is pretty similar to Berkeley. And SQL kind of emerged as the de facto standards that people call that SQL. Call it SQL, both are fine. I still don't know what you're talking about if you say that. So it's kind of emerged as the de facto standard. And there are standard specifications, but pretty much every system implements its own variant. So if you write a query for MySQL, it might not exactly match something that runs Postgres or Oracle or SQLite or whatever. So that's the tricky part. There is a standard that everyone like 98% adheres to, but there's this kind of wiggle room where people implement their own special divergences. So going way back to the Python example I gave earlier, this was the code where you're going to iterate over all the lines and file to find the year that Ice Cube went solo. And if we rewrite that in a SQL query, we get this a lot more compact specification for what we want and not how we want the database management system to give it to us. So we say we want to select the year from the artist's table, where the name of the artist is Ice Cube. So that's a lot, a higher level, and we don't have to worry about kind of the low level details about reading the lines, parsing the lines, all that stuff. We just tell the database management system this is what we want, this is the answer that we want, you figure out how to get it back to me. So that kind of wraps everything up. The key takeaways are that databases are ubiquitous, they're used in all sorts of applications all over the place. There's any amount of data being managed in an application you're using. There is almost certainly a database management system behind the scenes. Relational algebra kind of defines the primitives for processing queries on a relational database. And we're going to see relational algebra again when we talk about query optimization and execution, but for now the next class is going to be specifically about SQL. We're going to hear some more advanced topics with SQL. So that's it, and we'll see you on Wednesday. Thank you.