 Welcome back to the DMU for another semester. This is 15-445-645 database systems or intro database systems. My name is Andy. I'm pretty sure everyone is in the right place. So before we begin, I want to spend a little time talking about what this course means to me. So I'm going to dedicate this course to my mentor Leon Rinkles. He passed away earlier in the year. He was listed as one of the original data scientists. He has five degrees in MIT. He was doing data science where data science was a thing. So he really meant a lot to me, and this is with me and him and my PhD student, Joy Rourage, earlier in the year when he was at hospice. So as we go throughout the course, he's always going to be in the back of my mind with everything. The other thing I'd like to say also too is that I want to thank VoltDB for sponsoring the course. They provided funding for course development. So they'll be coming later in the semester to give a sort of guest lecture at the end. It'll be a nice wrap up to talk about all the things that we'll discuss during the lecture, and you'll see how it's being applied in real-world database systems. So we're very appreciative of them. So the clicker does not work. So before we start now to get into the details of the course, I want to tell you a little bit about myself. I really only care about two things in my life. The first one is my wife. The second one is databases. I don't care about anything else. I don't talk to my family. I don't have any children. The dog is probably number three depending on the day. But I really don't care about anything else. So what that means is that when I start talking about databases, I get very excited and I start talking very quickly. So if I go too fast, I need you to raise your hand and tell me to shut up and slow down and repeat myself. Because if you're confused about something, then somebody else is going to be confused about something. So please tell me to slow down. The other thing I also say too is that I will not answer any questions about the course material or the lecture after the class is over. So in previous years what will happen is, I'll give a lecture and then immediately afterwards, like five people will run up to the rest of the stage and ask me questions about like slide 23 during the lecture. I refuse to answer any of those questions because if you have any questions about the material, stop me while I'm speaking because if you're confused about something, somebody else is going to be confused about something too. So I'm more than happy to answer questions about the lecture as we go along. I won't answer any questions about the material at the end. So today's agenda is going to be split into two parts. We'll start off with talking about the logistics of the course, what the plan is, what you expected of you as the student. Then we'll jump right into the material and start talking about the relational model and relational algebra. Again, the assessment in the back is meant to be a test of your knowledge and skill with C++. It should really take you two or three minutes, my students here took two or three minutes. They've been writing C++ all year. So we're not going to teach C++ to you. If you're not comfortable with C++, then this might not be the right course for you, or you need to brush up on your C++ skills before we get going on the course projects. Now, the most important thing you guys probably care about is the waitlist. As of this morning, we're back up to 150 people. The max capacity is currently like 100 or so. I think we're like 102, 103. So if you're on the waitlist, I'm sorry, we just can't take everyone. People drop the course and we'll pull people off the waitlist and enroll them. The way we're going to do this is that we'll sign you based on the order that you complete the first homework assignment with 100 percent score. That way, we're not dealing with what programming you're in when you're graduating. If you complete the first homework assignment, which we'll give out on Wednesday, we're ignoring the waitlist order on S3. The order you complete that is the order that we will meet you in the course, and we'll keep doing this until we get full. As far as the most fair thing we've come up with, everyone has any better idea. Bribing the TAs does not work. We got in trouble last year for that. Please don't do that. We'll just do it based on homework one. Any questions? Yes. The homework will be on AutoLab. It'll be on SQL and then we'll release that on Wednesday. So you'll get a score right back and AutoLab keep track of the submission histories. We know who submitted first. This question is, if he's on the waitlist, we have access to AutoLab. Yes, when we open it up on Wednesday, everyone on the waitlist will have access to AutoLab. Any other questions? Okay. The seats in the front if you want to sit down. All right, so this course, 1545, 645, this course is all about the design principles of a database management systems. So specifically we're going to look at disk-oriented relational database management systems. And we'll cover what that is as we go throughout the semester. But all I'll say is that this is not a course about how to actually build database applications or model database applications, right? I'll teach you advanced SQL on Wednesday this week, but I'm assuming most people already know how to write basic SQL, right? Or build database applications already, right? So we're really focusing on how do you actually build the underlying software to manage the database, right? So this is not what you're looking for, right? This is why I'm asking you to take the C++ self-exam, right? This is a project-intensive course. If you care about more about the high-level things about how to model or administrate a database, then the alternative you might want to look at is CMU 95, 7 or 3 in Heinz College. I think it might be just called databases or database information systems, something like that, okay? Again, this is a computer science course about the design of database management software. Now there was also another database course in CSD called Database Applications that was sort of a hybrid between Heinz College course and what this course is, 15, 4, 15, or 6, 15. So some of you may have signed up for that. Unfortunately, that's been canceled because Professor Christopher Lutus is taking another year off to go join his Iggy Pop cover band, right? So this is their album they put out early in the year. So unfortunately, he's gone, so that's not gonna be taught this year as well. So as far as you know, this is the only CSD course available for databases. So the way to think about how we're gonna organize the course material is that pretty much this week is the only week we'll talk about sort of the high level stuff. How do you actually, the theory behind database systems, a relational model, and how do you write SQL on it? And then going forward, starting next week, we're gonna start at the bottom of the stack and start building up the layers to actually build a database management system, right? So by the end of this semester, you should be able to know and understand, pretty much how any database management system is gonna be built. Now there'll be some in-memory stuff we'll talk about in the advanced class in the spring, but this is really like the fundamentals of pretty much every single database system that's out there, right? So we'll start off with talking about storage and then we'll talk about how to do query execution, and on top of that, we'll talk about how to do transaction management with concurrency control, and then we'll talk about how to do recovery and maintaining logs and checkpoints and things like that. And at the end of the semester, we'll spend a few weeks talking about distributed databases or cloud databases, and I'd like to finish off the last week, which I call sort of a system potpourri, where we sort of do a brain dump on, here's how a bunch of different real-world database management systems are actually implemented. Again, the Voltivee guys will come and give a talk, and then I'd like to spend the last class and let you guys pick any database system and I'll spend 10 minutes discussing how it works, right? And this is useful, because now we'll be able to apply the techniques that we talked about during the lecture or the semester and see how they actually are used in real-world systems. So the course policy and the schedule are available on the course website, which I sent out last night on Piazza. Again, this is an upper-level course, I shouldn't have to go over the academic honesty policy, but that's the link to it at CMU, right? The bottom line is don't be stupid, right? This is not a group project course, everything should be done individually, even the C++ self-assessment. And so what I prefer is that if you're unsure about something, right, please come ask me whether that's the right thing to do or not, because I'd rather have you say, hey, look, I've been working with somebody on our project, both writing separate code, is that okay? I'd rather have you come talk to me first, rather than me find out later on that you guys have been copying code from each other and then we have to send you over to Warner Hall to destroy your life, right? So the bottom line is please don't be stupid, okay? If you're not sure about something, please ask me. All discussions and announcements for the course will be on Piazza. If you're on the wait list, you should be able to sign up for this, right? I don't think there's any restrictions there. CMU wants me to use Canvas, right? We use that last year, it was a disaster, so everything will be done on Piazza, okay? There is a textbook for this course, database systems constant from Avi, Hank and Shonishan. The, this is a, this is my opinion, this is actually probably the best database systems textbook on the market now. A lot of them that are out there and I've looked at most of them are like 10, 15 years old. This one I think is maybe six or eight years old by now. The newer version is coming out this year, but the sixth version is good enough for what we want. So again, for every single lecture, I'll provide the chapters. You can read for supplemental information to go over the material in a bit more detail. But there'll be some material that's actually not covered in the textbook. And so for every single lecture, we'll also provide lecture notes that cover the, so the main points that were discussed during class. For the grades, the distribution will look like this, right? So if you're a SES undergrad, this class now counts for the systems elective, right? And the requirement there is it has to have over 40% count for projects and that's why it's 45% for projects. So there'll be a midterm exam and a final exam. And then later on in a few weeks, I'll discuss the opportunity for doing extra credit. Basically, we're writing encyclopedic about databases, so you pick one database system that you want to learn about and you just write an article for it, okay? And then again, I'll cover this later in the semester. For the homework to be five during the semester, the first one will be the SQL assignment that we're putting out on Wednesday and that one will be submitted through AutoLab and it'll be auto graded and you get feedback right away. And then all the rest of the homework assignments will be penned on paper that you'll submit PDFs or photographs on Gradescope, okay? And again, these should be all be done individually. For the projects, it's sort of like in the OS class 440 where you're sort of building out your own core piece of software and every single project will build on your previous implementation. So by the end of the semester, the goal would be to actually build your own database storage manager from scratch and we'll provide you guys with some scaffolding to sort of fill in the key parts. So again, for this, it's gonna be all in C++11. So we're not gonna teach you how to write C++. You can't come to office hours and say, I don't understand C++. We're not gonna teach you how to use GDB, right? We're not gonna teach you other tools to do debugging. Again, this is CMU, should be able to figure this out, already know it ahead of time. If you're not comfortable with this, then again, this might not be the right course for you. I'll say also too is that every project is gonna build on the previous one, right? So the first one is the buffer pool manager. The second one is the indexes. You can't build your index unless you have the buffer pool manager. So it's really important that you meet all these deadlines and you're always keeping up with the coursework. And if you have problems with that, talk to me as soon as possible and we can figure out what the right thing is. The late policy is that for both homeworks and projects, you're allowed a total of four slip days, right? So the deadline will be always at midnight for a project or homework and then any one minute over that deadline is considered late up for a 24 hour period and then you're allowed to be late at least four days throughout the entire semester, right? So it's four for both homeworks and projects, not four for each category, right? And then when you submit it, please just mark down how many days are late and how many days you think you have left, right? That way you sort of do a self audit to know where you're at throughout the semester. And again, this is Carnegie Mellon, right? I shouldn't have to say this, but I have to. All the homeworks and projects should be done by yourself. They are not group assignments, right? It's not 440 where you have a partner. Everything should be done individually, right? That means you can't copy source code from other people in the class. You can't copy the source code from the internet. Now some people took our source code from last year. They weren't seeing your students. It's on GitHub. A senior code, it's crap, you don't want it. And it actually won't work this year because we're gonna switch some things up because we're gonna switch from SQLite to Postgres. But again, bottom line is just don't copy from other people. Don't copy from things that you didn't write, right? And then if we find it, when we run it through Autolab, if it finds that you copy code, then we have to report you, okay? All right. The last thing I'll say also too is that going beyond the material that we're gonna talk about today, if you wanna get more involved with database research here at Carnegie Mellon, just learn about more databases in general, there's three opportunities to do this. Again, I will announce this on Piazza. The first is that the CMU Database Research Group meets on Mondays at 4.30, right? This is a sort of casual setting where we have researchers, my students, and sort of sometimes outside, people come talk about kind of stuff they're doing in databases. If you wanna get involved on the development side of a database system, we're actually building a brand new database system here at Carnegie Mellon from scratch called Peloton. All my students are working on this. So we have our developer meetings on Tuesdays at 12 p.m. in Gates Hall, or Gates Hillman. If you wanna take the advanced class in the spring, it's all based on the system. Again, if you just wanna get your hands dirty on a sort of what I call a sort of commercial grade database system, although we're not there yet, then you can get involved in this project here, right? Some master's programs, you have to do capstones. If you wanna eventually do a capstone with me a year from now, it'll be based on the system, so you wanna get started as soon as possible. What I also say too is that, again, not to toot my own horn, but all the students that come work with me on this project have more jobs to turn down than they know what to deal with, right? Because all the database companies, my friends at database companies, are emailing me asking for more students that come out of our group, right? Because they can't hire people fast enough. Now, the current name of the project is Peloton. We're gonna have to change the name this semester because there's assholes with the exercise bike on TV. We don't know if we're gonna call it yet, but that's the current name now. And also to say too, is we spent the summer actually rewriting a lot of the storage engine from scratch, and now we're gonna start bringing in the pieces of the old code into our new repository. So it's a good opportunity to get started early on this and sort of understand how all the different pieces work together. And again, I promise you, you will have no problem finding a job if you get involved in database systems. And the last one also to announce is that, I am running a seminar series this semester that'll be on Thursdays, but not every Thursday, in the CIC building. And so what I do every fall semester is I have a seminar series on sort of one particular theme or category of databases. So this year, we're doing hardware accelerated databases. So the thing of like a database system that instead of running everything on the CPU, they can run it on the GPU or other exotic hardware. So the first speaker will be next week, coming from Connecticut, which is like again, a GPU based database. Everything will be on YouTube, but if you wanna come and get pizza and sort of meet and talk about what these guys are doing, then I encourage you to do that, okay? Any questions about the course? What's expected of you? Yes? Will this slide deck be uploaded online? His question is, will the slide deck be uploaded online? Which slide deck? These are mine. Everything will be on YouTube, everything will be online. But it's a DIY operation, so if it doesn't work, it doesn't work. But the slides will be up. Any other questions? Okay, let's jump into it. Databases, my second most favorite thing in my life. Can everyone give me an example of a database? What's that? She says SQLite. No. He says MongoDB. No. Redshift, no. Okay. Go for it. What's that? There you go, it's students of the class, right? So she said SQLite, he said MongoDB, he said Redshift. Those are database management systems. He said the students in the class. That is a database, correct. So a database is an organized collection of interrelated data that's gonna model some aspect of the real world. So as an example, it's perfect. A database could be the list of students that are enrolled in this class, right? Now you need software to actually manage that, and that's what the database management system is, right? Or DBMS, right? So the reason why I think this course is important, obviously because I'm biased, but the databases are gonna be the sort of core component of almost every single aspect of technology or computer applications that exist in the world. I guarantee you that even if you don't go into the computer field, if you go to any field that's related to technology, you will come across databases throughout your life, right? Every single website you can think of is that's doing something meaningful is gonna be backed by a database, right? Every single major computer application is gonna be backed by a database at the end, right? And then you need a database management system to actually manage that software and provide you some nice guarantees that you don't wanna have to end up writing yourself. So let's use a simple example of a database and then we'll see actually how we can actually maybe manage this ourselves or manage this with software. So let's say that we wanna create a database that models a digital music store, something like Spotify, right? In our database, we wanna keep track of the artists and the albums that they put out, right? So what do we need to store, right? We have information about the artists and then we're gonna have information about the albums that those artists ended up releasing, right? So the easiest way, the most trivial way to actually implement something like this is to use what are called flat files. So basically you have a text file that's using a CSV format, comma-separative values, right? And you have one file for artists, one file for albums and every single line in the file is gonna be one entry or one album or one artist, right? So this is the most primitive sort of database you could actually build, right? And what's gonna happen is in order to actually to operate on it in our application code, we're gonna have to write our own code to actually manage these files themselves. So if you wanna do a lookup to try to find a particular entry, we're gonna have to open up the file, loop through it and look at every single line, parse the commas, find where the name of the artist is and extract the one that we actually want. So let's see what this will look like, right? So again, say I have two files, one for artists, one for albums. So in the text above the file, it says artist and then in parentheses I have the attributes that they're storing. So the artist can have a name a year in a country, right? And the album can have the name the artist in the year, right, pretty simple. Each attribute on a line is gonna be separated by commas. That's how we're gonna denote the different fields or different attributes. So let's say I wanna write a simple query that gets the year that Ice Cube went solo, right? Ice Cube was in NWA, right? They were screwing him over for money, so then he broke out and did a solo career, right? So the way we would do this is that we just write sort of Python code like this that again just iterates every single line, checks to see whether or each line split up the commas to get a record and then jump to the first offset in that record array and the check, which we now know is the name, to check to see whether it matches the thing that we're looking for. And if we find the thing we want, then we then just print out the year field after casting it to an integer. Pretty simple, right? What are some problems with this? Yes? We don't know if the artists are in chronologically, so you might get a year that he went solo but not the year that he went solo. So he said that we don't know whether the, sort of the lines or the columns. No, we don't know that the first thing that appears for Ice Cube would like to be the earlier. He said that we don't know whether the first thing that appears for Ice Cube is gonna be the year that actually he went solo, right? Again, for this we can assume that Ice Cube appeared once, yeah. Yes? It's what? Right, so you said it's costly, meaning like my example here has three lines. Even 10,000 lines is not gonna be a big deal. Think 10 billion, right? Some, you know, there's databases that big. Like the Walmart database, it's every single item that anyone's ever bought at a Walmart, right? It's gonna be billions, right? I think so I think we're all in agreement that this sucks, right? This is a bad idea. Let's go a little more detail what's going on. So the first problem we're gonna face is that how are we gonna ensure that we always use the same artist name for each album entry, right? So if Ice Cube puts out different albums, I have in my album file, right? I'm repeating Ice Cube over and over again, but let's say Ice Cube changes his name for whatever reason, how do we make sure that we update all the entries for Ice Cube? So they're always the same. Next problem is that what happens if someone comes along and puts in changes of an album year and puts in a funky string that's not an actual integer, right? Again, these are just text files. I can open up my favorite editor and muck around them as much as I want. And then now my program that's actually gonna parse them, try to figure out what interpret meaning from them, it's gonna come across data that it doesn't expect to actually see. So what should actually happen? And there's no way we can prevent that because again, there's just text files sitting in the file system. Anybody can open them up and change them. The next problem is that how are we gonna deal with the case where we have albums with multiple artists? If you have a mixtape that gets dropped and there's a bunch of artists that are on there together, how do I actually model that in my flat files? I was assuming that there was only one artist at a time. But now I gotta do something scurry, like maybe I'll take my artist field and I'll make an internal commerce-separated list to denote the multiple artists. But that's kind of weird now and I have to change my application code to actually deal with that. Then we actually have now the code to actually open up these files and find the data that we want, right? So the first problem is how do you find a particular record? So she pointed out that in my simple example here of three lines, that's not a big deal. Then she said maybe 10,000 lines would be a problem. That's not a big deal either. It's when you actually get into the billions, like really large data sets, then open up the file every single time you run a query and scanning through every single entry is gonna be really slow, right? Just to find one record, right? Maybe we luck out and the first one is the one that we're actually looking for. Worst case scenario is the last one, right? Now I'll say also too is that maybe for our first application we built, we were running like a website that was using these files. So now we wanna have like a sort of a separate service or a desktop application to actually use these files, right? Now we have to rewrite all the code we had in the first application to open up the files and parse them and get that with the data that we want. We have to duplicate that in our second application and so on, right? Now we gotta make sure that they're always in sync as well, right? If we change any aspect of the files, we have to make sure we have to change both applications. The last one is that what if we have, you know, two separate applications or two processes or two threads, both trying to update the same file at the same time, what's gonna happen, right? Well, we just let the operating system manage this for us, then it's gonna be the last writer, right? So the first guy could open the file, second guy could open the file, the first guy does the right, the second guy then does the right, it's gonna clobber the first guy's right. So we end up losing data, right? The last problem now we gotta deal with is how to make sure that our data is safe, right? It's not cool that your bank loses your money, right? We don't wanna lose any data. So how do we actually ensure that? Well, if we have our simple Python code to open up these files and parse them and we start updating things, what happens when the machine crashes or the application crashes while we're doing a write, right? Worst case scenario, we just lost whatever last thing that we wrote, right? Even worst thing would be actually the file gets corrupted and we lose all our data. So we need our application to make sure that we don't lose anything. But now let's say that we wanna make sure that our database is always accessible, right? If you have a website, you never wanted to go down, so you don't always need to be up. So what you wanna do is maybe replicate it across multiple machines so that if one machine goes down, the second one can keep on running and still service requests. But now if I have two machines updating the same data at the same time, how do I make sure those things are in sync? So the answer to this is a database management system, right? A database management system is software that's specifically designed to store and analyze information in a database, right? There's different kinds of database management systems or different systems that do sort of different types of application scenarios, but at a high level, this is what they're gonna do. And so the idea to think about this is instead of writing all that code to open up those files and manage them for every single time I write a new application on my database, a database management system will do this for me, right? And that way, I can worry about writing all the business logic or the complicated code that I wanna have in my application that actually serves the purpose of the business or whatever it is that I'm trying to accomplish, and I leave the database system to manage all the durability and guarantees that I would want for my data. And the reason why I think, again, think about why this is so important, think about all the courses that CMU offers, right? Or in the CS department or SCS, right? There's a course just for this database systems. Actually, there's two courses, because I have two of them, right? There's no course that you had to write a web browser, right? There's a course that teach operating systems, right? These are important core components you need to have in modern applications. So that's why I think this is worth studying. So now it's sort of taken for granted in today's time that we have all these sort of database options available to us, but didn't always used to be the case. Right again, when I asked you guys to name a database, I heard Redshift, I heard Mongo, I heard SQLite, right, these things exist now and they solve a lot of the problems that we want to have for databases. But back in the old days, it certainly wasn't like this at all. So if you go back to like the 1960s, 1970s, when the first database systems sort of came online, these things are actually really difficult to use and maintain, right? And part of the reason is because there was this tight coupling between the logical layer of the database and the physical layer. So the logical layer would be what I was showing before when I described the music store. The logical layer would be here I have an artist and an artist has these attributes and I have an album and album has these attributes. The physical layer would be how that data is actually being represented and stored in bytes on in memory and on the hard drive, right? So back then in the 1960s and 1970s, these first database systems, you would actually define exactly the physical structure of the database in your application code, right? So the first example of one of the earliest databases was this thing called IMS from IBM, Information Management System. It's actually they built this database system as part of the Apollo moon mission, right? To keep track of all the parts they needed to build the rockets to go up in the space. So back then the way IMS would work was, you would say, all right, here's my database and I want to store it as a hash table or here's my database and I want to store this as a tree structure, a B plus tree. And then based on what you actually defined as the physical data structure for your table or your database, that would then expose a certain API to you in your application. So if you got a hash table, you could do point queries, do single key lookups. If you had a tree structure, you could do range scans. So the issue was then now, if you later on decided, oh, I don't want a hash table, I actually want a tree structure, then you got to not only dump out the database and put it back in as a tree, but you need to go back and modify all your application to now use the tree structure API instead of the hash table. So essentially, you kind of had to know exactly how the data was going to be used before you actually started building your application, which is not always the case and not easy to do. So what happened was at IBM, in the late 1960s, early 1970s, there was this guy fresh out of grad school named Ted Codd, and this is IBM research in New York, in Watson. And he basically went around IBM and he saw all these people writing, these programmers rewriting their database applications over and over again because they had to deal with the case of the physical data structure change or the logical structure change and then rewrite the application from scratch. Now back then, again, humans were cheaper than machines, it's the opposite now, so you just throw more money and hire more programmers, but this is not scalable, right? So he saw this and he saw that this is a clear problem and this is because there was again a tight coupling between the physical layer and the logical layer. So he wrote this seminal paper in a tech report in late 1960s and then there was actually the full paper in CACM in the 1970s and he proposed what was called the relational model. And the relational model basically has three key points. The first is that you're gonna store your database as these simple data structures called relations. This is at the logical level, you're not just defining trees, you're not defining these weird hierarchies that people were using back then, but you just say here's my relation, here's the attributes that it has and that's the end of it. And I would have all the relations as well. Then you would have now a, you would access your relations through a high level of language, or declarative language, right? This is before SQL, but it was sort of basically what SQL is now, right? So no longer in the IMS case would I would say, all right, I have a tree data structure and here's the API for me to make calls to it, you wouldn't do that anymore. You just say, here's the query I wanna execute, you don't care how the data system actually stores it, it'll figure out how to do this for you, right? This is actually pretty controversial back then, right? Back then people were saying there's no way a query optimizer and a database system will be able to generate a query plan that's as efficient as what a human can code. This is sort of similar to what people were saying back then about compilers. There's no way a compiler, a C compiler could generate machine code that's as efficient as assembly written by a human, right? And for some of the cases, sure, that's true, but most of the times nowadays it's not, right? Compilers is actually pretty good. So the last piece is what I was saying about before is that because now we have this decoupling between the logical layer and the physical layer, it's now up to the database band of system itself to figure out what the most efficient way to actually store your data, right? For each table, each relation. So no longer did I have to say store this table as a tree, store this table as a hash table, I would just say here's my table, here's the assets that it has. And it was up for the database management system to figure out based on the queries you wanna run on it, what was the most efficient way to actually store this? Right, and we'll see this later on. There's nothing that says that it actually has to be static, right? It could decide a tree structure is the right thing now, but based on how your application is actually using it, it could switch over to a hash table or some other data structure. And that's fine because of that first point, because we're writing, or the second point, because we're writing our queries in a high level language that doesn't have any information about what the underlying data structure is, the database system is free to change this any way that it wants, and use the program and don't care. So the other key thing that Ted Codd also produced is there's no issue of a data model, right? The data model is essentially a high level concept that is a way to describe the collection of the data that you would actually store in a database. So the relational model is one example, on the next slide we'll see a bunch of other ones, right? He's sort of the first person that said, all right, here's what this actually means. You're not just dumping data into a database, you actually can represent it as data model at a more high level abstraction. And then a schema will be an actual description of a collection of data for as given data model. So again, if you're familiar with SQL, right? SQL you call create table, define a table, right? That's the schema to represent a piece of data or a collection of data within a relational model database. So as I said, there's a bunch of different data models that are out there. These are probably the main ones that are out there now. There's even more obscure ones. Theoreticians went buckwad on this in the 1980s, and there's a bunch of different things that are some systems out there that actually implement it. Most of the time you don't need to know, don't need to care. So again, most database systems that you can think about are using the relational data model. So the three we have listed at the beginning, SQLite, Redshift and MongoDB. SQLite and Redshift are relational databases. MongoDB is considered a document database route, right, or an object database. There's another class of systems that you might be familiar with called NoSQL. They sort of fall into these data models here. Key value, graph, document, column family. Again, Mongo is a document database. Graph database would be something like Neo4j. Key value would be like Redis, DynamoDB, and column family would be like HBase or Bigtable from Google. A lot of things, so people say these systems are called NoSQL. A lot of them are actually now adding back SQL support, you can run SQL on top of these different data models. You also have array and matrix data models, so instead of serving things as relations, you use sort of things as multidimensional arrays. There are small number of systems that actually do this. SIDB is probably the most famous one. And then the last two are these hierarchical and network data model. So IMS is hierarchical. The network data model is this other thing called IDM, what's a code of sale model. These are obsolete and rare. If you hate yourself and you want to get a job in the industry to do maintenance on old code from the 1960s, you will come across these style of databases. But if you're a new startup, there's no way in hell you would actually use these, okay? So for this class, we're going to focus mostly on the relational data model. And part of the reason is that with the exception for the array and matrices, you can represent all of these things, all these other data models in the relational model, right? A key value data model is it's just a relation with two columns, right? Graph databases, you can represent the relationships as, again, as another relation. Is my opinion that most of the times, what people need is a relational data model. There are obviously some cases where one of these more specialized systems might actually be better. But it's more about the API that exposed to you as the programmer and not so much how the data is actually stored, right? So in the graph databases, for example, they have specialized, you know, function calls and ways to run queries that assume you're running on a graph model and therefore they can work more efficiently than writing a SQL. But again, at the storage level, they're essentially the same. Okay, so the relational data model is going to define three things. So the first is the structure, right? And this is essentially a definition of, for each relation, here's the attributes that they have and here's the domain of the values that you can actually store in them. So my example earlier, when I showed the artist relation or the artist table, right? An artist had a name, a country and a year, right? The name has to be like a string type. The year has to be integer type, right? You define that in the structure. Then you have the integrity definitions and this is essentially the constraints you can impose on the tuples and the values for the attributes that have to be satisfied in order for you to store a tuple in the, store a tuple into that relation. So again, in my structure, I define, I have three fields, name, country and year and then an integrity constraint could be I can't store anything unless it's an integer for a year, right? Otherwise it's an invalid tuple. And the last one is defining the manipulation constructs and this is essentially how we're gonna actually access and modify the data in a relation. So this is what we're gonna talk about today when we talk about relation algebra, right? And SQL, this is essentially you define your tables and then you write queries to actually get to the data. So the relational model is based on this notion of a relation, right? So don't think of it as like, I'm in a relationship with my wife or your whatever, right? It's the actual table itself in the mathematical terms is called a relation. And so a relation is gonna be the unordered set that contain the relationship attributes that represent different entities in that table, in that relation, right? So the important thing that's gonna come up throughout the semester is this notion that the relations are unordered, right? And this is gonna actually, this matters from a systems amplification standpoint because we don't have to maintain ordering. This allows us to do a lot of things more efficiently than you would have otherwise, right? So it also means that we're gonna end up cases where a query can produce different results that are actually still all considered correct because things are unordered, right? If you care about ordering, you define an order by to sort things, right? If not, then it can be any random order that it wants. And then within a relation, we can have a tuple, right? And that's gonna be defined by the set of actually values for a given instance of a relationship within our relation. So under the original description of the relational model from Ted Codd from the 1970s, the values for the attributes in a tuple have to be scalar or atomic, right? I mean, they can't be lists, they can't be sort of nested data structures. In newer systems, that's actually relaxed, you can actually do this, but in the original definition, you couldn't. Then we're also gonna have a special value called null that works a lot like how nulls work in C or C++, right? It essentially means that the value for a particular attribute will be undefined, right? This data simply doesn't know anything about. So in our example from before, from artist name and year, right? This is defining the relationship for that, right? It essentially looks the same thing as our flat files, except now that the data system actually is maintaining the field delimiters themselves. And no one will have to parse commas. To be a bit more mathematical, we can say that a relation that has n attributes is called a n-ary relation, right? n-ary will come up later on when we talk about the storage models for these things, but just so that you know the term, it means n-ary. The other key thing about relational models is that it introduces two types of keys. So primary keys and foreign keys. So a primary key is some set of attributes for that relation that will uniquely identify every single tuple, or exactly one tuple. So in this case here, none of these are actually the primary key for us because there's nothing really stopping from someone else from naming their band Ice Cube, right? So we can't assume that the name is gonna be unique, right? So what we can do instead is we can introduce a special primary key called the ID, right? And this is just some unique integer to represent exactly the one tuple that we want. So what'll happen is that if you don't define a primary key, some data systems will actually generate one for you, but they'll maintain it internally. So my SQL, for example, if you don't declare a primary key, underneath the covers it uses the row ID, which is like basically the block ID and offset of where it's actually stored on disk. I use that to represent the primary key for you. We'll talk about primary key indexes later on, but if you create one, then there's much of other optimizations you can do to find data more quickly for doing selects and updates and things like that. And then instead of having to maintain this counter for the ID manually in your application, a lot of data systems also support for what we call auto-generated keys. So in the SQL standard it's called sequence. In my SQL, we'll see this over and over again, my SQL likes to do things differently for whatever reason. They call them auto increment. And basically what happens is now if I insert a tuple, there's a counter that always gets added one and creates a new primary key entry for me. The other type of key if you care about are called foreign keys. And this is gonna allow us to model the relationships of relations to make sure that we're always in sync with each other. So foreign key is gonna allow us to say that one attribute for one relation has to have a matching value in another relation. Remember I said before the example was how do I make sure that if I have all these albums for Ice Cube and Ice Cube changes his name, how do I make sure that they're always in sync with each other, right? Well, foreign keys will solve this problem for us. So this is the example we have before, right? Now we're using the primary key with the integer as the ID field. But in the case of using the mix tape here, say this I have multiple artists here. I said before the relational model has to have scalar values. So I can't have a list of values here. I need to have some way to keep track of those multiple artists that belong to a single album, right? So I can do this through what's called a cross reference table. And here the standard way to denote this concatenate the two names of the relations, the two names of the tables together. So now I have a new relation called artist album that's gonna have an artist ID and an album ID. I no longer need to store the artist in the album table and then I have foreign keys now and this other relation now pointing to these both of these things here. And the combination of an artist ID and an album ID is the primary key for this relation because you can have multiple attributes together as the primary key. This assures that one artist can't be on the same album multiple times or denoted that's on the same time, right? So we'll see this later on when we talk about joins but now I can write queries that will combine these relations together and match up the different attributes based on their foreign keys. And underneath the covers of the data, most data systems will actually do is that if I try to insert an entry into the artist album relation that doesn't have a matching artist in the artist table and doesn't have a matching album in the album table it'll throw an error. It'll prevent me from doing that, right? Because it says I don't have a, you know, there's no corresponding match for the foreign key. Or likewise, if I delete an entry from the album table I can have it automatically delete the entry from the artist album table, right? This is called a cascading delete, right? This ensures that there's no dangling pointer just to something that doesn't exist anymore, right? This is another good example of what a database management will provide for you that you would otherwise have to write yourself in your application code. All right, so now that we understand roughly what a relational model is and how we actually can store data in them now we need to actually get data out of it, right? Put data in, get data out of it, right? And there's essentially two sort of classes of languages that you can implement for this. The first are called procedural where this is where you're gonna define at a high level the actual steps you're gonna execute to run your query. And the second one is called non-procedural where you describe at a really high level the answer you want the data system to actually compute and then it's left to figure out how to actually do that automatically on its own. So the first one is with the category of relational algebra is which we'll cover in the next couple of slides. The last one is what relational calculus is based on. We are not gonna cover this because it's usefulness is very limited for what we care about in this course. If you're building a query optimizer from scratch this is something that you have to understand or let's say you wanna replace SQL with something even better although in my opinion SQL is pretty good then you have to understand relational calculus to actually how to do this. So we're gonna focus on relational algebra here. So relational algebra again is the final operations we can define to actually retrieve and manipulate data that actually stored in a relation. And this is again this was defined by Ted Cobb when he wrote those early papers in the 1970s, right? He sort of laid out here's what a relational model is and here's a relational algebra to actually read and write data or get data in and out of our relations. So there's gonna be seven different operators we're gonna care about. We'll go through these at a high level though mostly come up later on when we talk about query execution. They're not hard to understand. I think it's important to see them once so that when they come up again you know what they are. So each operator is gonna take in either one or more relations as the input and it's always gonna output a new relation. The idea here is that when we actually wanna write queries we're gonna chain together all these operators together using them as sort of the building blocks to then generate more complex queries that derive more complicated answers than the basic things. So the first relational operator is to do a select and that's represented by the lower case sigma symbol. In my opinion this is easy to remember because to select starts in S and the sigma starts with an S. But the basic idea is here is that we're gonna take our relation as our input and then we're gonna select or choose a subset of the tuples that we actually wanna produce as our output. And the actual original relational model papers these are called restrict because it's essentially restricting what tuples are actually put in the output. It's gonna be represented this with the lower sigma followed by a subscript predicate. I think of something like classic Boolean logic like if something is greater than something or something less than something and then we define what relation we wanna apply the selection on. So let's say I have a simple relation like this I have relation R has two attributes A, AID and BID. So I can have a select like this that says where AID equals A2 and then the output that's produced is just again any tuple where the attribute AID equals A2. I can start combining these together again using conjunctions and disjunctions to do things like select where AID equals two and BID is greater than one or two. And then that just only selects that one tuple as the output, right? Pretty straightforward. In SQL essentially looks what we're defining here is essentially the same thing as the where clause, right? So what we have in the where clause below is the same thing that we had up above for our predicate here, right? Again, it's not exactly the same as a select because a select in SQL can do much more stuff that we can't do here in relation algebra. Next thing we have is projection and projection is basically going to limit which attribute to actually get produced as the output from the input, right? So again projection starts with a P, you represent this with a lower case Pi symbol. The thing you can do with the projection is that you can, more than just saying take these attributes and produce as my output, you can actually reorder them, right? The output of these operators are actually, the order actually matters for the attributes in a single tuple. The ordering of the tuples across the entire relation actually doesn't matter. You can also actually manipulate them to actually change the output based on some additional logic. So I can do a projection like this where I say I have my inner select where I say AID equals two from the relation R but then in my projection list, I want to take the BID, subtract it by a hundred and then output the AID. So I'm flipping the order and manipulating the BID, right? And again, now you see how we start to chain these things together to produce more complicated things. And in SQL it would essentially look like this, right? The output target of the select statement is the same thing in my output target in the projection operator. Yes? Is it fair to say that SQL is an intensity procedural or I'm slightly confused about that. Right. It's question is, is it fair to say that SQL is inherently procedural? No. So SQL is actually a non-procedural language based on relational algebra. And the difference here is that in SQL, I'm defining what I want, not actually how to actually execute that, right? In relational algebra, you would start at the inside of the parentheses. So I had to do this select first and then I could do my projection, right? So it's sort of like, it's almost like the exact steps you have to execute to actually produce the answer. In SQL, for this simple example, you would essentially do it the same way the relational algebra was written but when we see more complicated things, you don't have to execute it the same way that the relational algebra actually defines it. All right, so now we can talk about the additional binary operators. Again, if you take into discrete math or any kind of set theory course, this should all be very familiar. We can do a union. You basically take two relations and you're gonna take all the tuples that are in one relation, all the tuples in the other relation and combine them together in a new relation. So for this, I now have our relation, R, AID, BID, and S with AID and BID. So when I take the union to them, it's just again concatenating the two relations together. So in this example here, it is ordered based on the two relations. So in the output, I have all the tuples from R, followed by all the tuples from S, but again, under the relational model, it's actually unordered. So this can actually be in any possible order that it wants, whatever the system wants to use. And then there's a union operator in SQL that essentially works just like this. So for this to work, though, what I'll say is that you have to have the two relations you're trying to union together. They have to have the exact same attributes for the same domains, the same types, right? If the S relation has a third attribute, then the union wouldn't be allowed to proceed. And it works the same way in SQL. There's also intersection, again, same thing, where you just basically get all the tuples that appear in both input relations. So in case of R and S, when I take the intersection of them, I only get that single tuple. And again, the same thing, I can do this, there's an intersect operator in SQL, and it has to assume that the two relations have the exact same number of attributes. The last one is difference. And again, you're basically taking all the tuples that appear in the first relation and removing any ones that appear in the second relation. And so, again, it works just like the other ones. I take the difference and I produce the output. And in SQL, the keyword is accept to do this. All right, so now we're gonna maybe do a more complicated thing is to start combining these tuples together other than those basic set operators. So the first thing we do is take the product of two sets. So this is sometimes called the Cartesian product. Basically, you get all the tuples, you're gonna get a new relation as your output that contains all possible combinations of the tuples from the two relations, right? So you take R and S, you take the product of them, the cross-join, and then you produce, again, this giant map like this. So here now, again, we're not, we're actually concatenating the tuples together and producing them as a new tuple. So the first two relations, the original relations, each had two columns. In my output now, they have four columns, right? So for every single tuple, in the first relation, I'm concatenating together with every single tuple in the second relation and so forth. So this seems kind of useless, right? Why would you actually wanna do this? And I'm gonna think of an example where you would actually wanna use something like this. What's that? Join, we'll see join next slide, but no. This is the type of join, but it's actually, it's not trying to match things up, it's just trying to get all possible combinations. So this shows up in testing a lot, right? If you wanna make sure that you test every single combination of different parameters, you take the Cartesian product, right? It's actually really simple to implement, right? Because there's two for loops, so you just, for every single entry in the first one, loop through every single entry in the second one and concatenate them together. In SQL, there is a cross-joining operator like this. You can also get the same thing if you just have two relations listed in the from clause without a where clause to define how they're actually being joined together. All right, so the thing that they asked about or the thing that somebody mentioned before was to do a join. And so a join is basically like a super set of a cross-joint. And this example here we're doing essentially what we call a straight join or equa join or a natural join where we're gonna take, we're gonna create a new relation where we get a combination of the tuples that match some predicate to do our join in both of our relations. So in this example here, we're gonna take the, all every single attributes in R and we're gonna check to see if there's a tuple in S that has the exact same values in the exact same attributes. Right, so the output would look like this. So A3 and 103, there's only one tuple in R that has a match in S and that produces our output. It's called a natural join because it's sort of natural way, natural in quotes, you would combine together these two relations. In SQL, you can invoke it directly as a natural join like this. So again, it's gonna look to see, do I have two attributes with the exact same name in these two relations and check to see whether their values match up. Yes. What is the difference between this and the intersection? This question is, what difference between this and the intersection? The intersection is removing any attributes. No, the attributes are the same. Yes, good question. It's a difference. No, sorry, one. That contain the tuples that appear in both of the two input relations. In this case here, yes, it's the same. Yes. So natural join doesn't appear actually that much. You actually don't wanna use this because you're assuming that the, you're sort of assuming that the schema can match up based on the names. We'll see examples when we talk about joins later in the semester where you can actually define how you wanna join in on clause or the where clause, right? You actually, we'll never see something like this, but this is how you would define it in the original relational algebra. In the back, yes. This question is, if the AID did not match with the BID value match, would I still produce a joined output? No, right? Because the natural join, because for every single attribute that has the same, every single attribute that has the same name, actually this is the answer to your question. Every single attribute has the same name, I'll compare to see whether they match. And if at least one of them doesn't match, then it's not produced as the output. So if you're assuming this natural join and intersection, intersection, you have to have exactly the same attributes. In the natural join, I could have a third attribute on S and it wouldn't actually match them, right? Because there's no corresponding one in R. Okay. So these seven operators that I just went through, these are the, again, the basic building blocks of what was defined by Ted Codd in the original relational model papers in the 1970s. But it's obviously there's a bunch of examples we can think of where the original relational, these original relational algebra doesn't cover, right? Sorting, how to actually do updates, renaming the renaming the columns. So after the original paper, there was a bunch of extensions for the relational model, relational algebra that cover all these things that we'd actually care about in the real world, right? So these are the basic symbols that actually do these things. Again, from our purposes, we don't really care so much because we're actually building a system that we can then apply these things on top of and go beyond what the original relational algebra can actually do. But again, these are just important to sort of understand and know that they actually exist and we'll see them later on when we talk about query processing. Okay. So the question we had early before was is SQL essentially just a relational algebra? And I said no. All right, again, the reason is because relational algebra is still going to define the exact steps you want to use to actually execute the query. So what do I mean by that? Let's say I have two, I have a single query that wants to find all the entries from the natural join of RNS where BID equals 102. So these two relational algebra statements here are actually equivalent. They will produce the same answer, right? The first one does the natural join between RNS, then it does the projection, or sorry, then it does the restriction on the select to filter out all the entries where BID doesn't equal 102. And in the second one here, I actually do the selection on S first, then I do the natural join. These will produce the exactly the same answer, but these are actually gonna have a lot different runtime performance, right? If I have, if every single tuple in S has value equals 102, then it's the same thing as this one here because it always can produce the same output. But if only one out of a billion has value of 102, then the second one's actually faster because I'm gonna filter out everything and then actually do the join. So again, this is why I was saying that in a procedural language like rational algebra, you're still defining the steps you'd actually need to do to actually get the query, and what we really want is just to say at a high level what the answer we want, and then we'll have the data system figure out how to do that for us. So in English, this sentence here, retrieve the join tuples from R and S where BID equals 102, that's, this is equivalent to what these guys are doing, but I didn't say actually how to do it. The data system can figure out, oh, I have one tuple out of a billion that has value equals 102, so let me do my predicate first, let me do my filtering first, and then I'll do my join, right? Or if it knows that all the values equals 102, then I'll just do the join anyway because it doesn't actually matter. So this is what SQL is gonna be for us, right? It's gonna be a way for us to define at a high level, declare at a high level what we want the answer to be, regardless of how the data's actually being physically stored, or how we wanna order our operators in our relational algebra. So we'll see this later in the semester when we talk about query optimization, there's a bunch of rules you can define how you actually can reorder relational algebra operators and be guaranteed they're always gonna still produce the same answer. So the main takeaway from this is that SQL is essentially the de facto standard for how you write queries on the relational model databases. And I'll actually a lot of the other data models that I talked about before, a lot of these no SQL systems and other types of systems, they're now adding SQL support on top of their systems because it's the de facto standard, it's everywhere. This is what people expect if you wanna have a sort of a database system that people can use or wanna use. So essentially what's gonna happen is this is an example that I had before where I looped through my flat file, looked at every single line and tried to find the match where the name equals ice cube, but now I can just declare it in my SQL statement at a high level. This is the answer that I want and today's system's free to figure out the best way to actually do it. Okay, any questions? Okay, so we'll finish up here. Please leave your self-assessment for the C++ stuff before you leave. Again, so I'll post the answers tonight on Piazza. Again, for me try to figure out how you actually, how well you actually know C++. The main takeaway from all this again, databases are everywhere. You're gonna come across them throughout your life so they're important. Relation to algebra is gonna be the primitives we can build on top of and do more complicated things. And then we'll see later on how to actually, how do you take relation to algebra and generate optimized queries for that, okay? All right, so next class we will cover what I call advanced SQL. This is sort of like gonna high level over your SQL. Most of you guys have already seen SQL before, so we're not gonna spend time on how to write selects. We'll start with the advanced stuff on Wednesday and you'll need this for homework one, okay? All right guys, see you on Wednesday. Ha ha ha, that's my favorite all-time job. Ha ha ha. Yes, it's the S.T. Cricut, I-D-E-S. I make a mess unless I can do it like a G.O. Ice cube with the G to the E to the T.O. Now here comes Duke, I play the game where there's no rules. Homies on the cusp of y'all my food cause I drink bro. Put the bus a cap on the I's bro. Push week on the go with a blow to the I's. Here I come, will he eat that's meat? I'm a five, six pack for the act against the real. But y'all are drinking by the 12. They say Bill makes you fat. But saying I's is straight, so it really don't matter.