 Welcome to CMU 15-445-645 Introduction to Database Systems. My name is Andy. I'm the instructor for this course. As I said on Piazza, I can't be in Pittsburgh right now. We are getting ready for Matt's cage fight on Wednesday. So we had to be out here in LA to prepare for that fight. So rather than just not having classes this entire first week, I wanted to record this while I'm on the road, post it online, and then that way, when we start having in-class lectures next week, we can get started on the good stuff. So before we get into today's lecture, I first want to talk about how Oracle is helping us out this semester with course development. So Oracle is actually one of the first relational database minimized systems that came out in the 1970s, and I'll explain what a relational database is in a few slides. So again, it was one of the original ones, but it's still widely used and sold today. It's probably the second most deployed database system in the world. It's certainly the most deployed commercial system in the world. And so even though it's from the 1970s, and there's other systems in the 1970s that are still sort of around, Oracle is actually still under active development, and they're adding a lot of interesting new modern features to it. So this course is about the, what I'll call it, maybe traditional database minimized systems or a classical design, and the interesting thing to see when they come in this semester to give a guest lecture and talk about some of the stuff that they're adding to Oracle today, how they sort of go beyond what we'll talk about in this course and sort of looking at more advanced topics. So for this lecture here, I want to first talk about the overall outline of what's expected in the course for you as the student taking it, and then we'll finish up giving sort of a half lecture on the relational model and the relational algebra, and this sort of be the background that you need to have for the various topics we'll discuss throughout the rest of the semester. So if you're a student at Carnegie Mellon, then probably the foremost thing that's on your mind is the wait list. So unfortunately this year, they gave me a smaller classroom in Margaret Morrison than I wanted. So that means we can't take a large number of people. I think the max capacity is 100. I think it's like 96 is in the room. We can flood that number a little bit. But I mean, the main takeaway is here we can't accept everyone. So the wait list now I think is 150 when I checked earlier today. So unfortunately that means that if you're not enrolled in the course at this point, the likelihood that you're going to get in is very low. I mean, people will drop the course over the next week or so, and then we'll take people off the wait list. But unfortunately, I just can't take everyone. So you're more than welcome to audit the course. Just let me know that you're auditing. But we can't officially hold that many people. So the way we enroll students as current students drop out will be just on your wait list position on S3, on the registrar's website. So again, I'm sorry, but just everything's available online. So you're more than welcome to follow along if you want. All right, so 1545645 is at its core. The course is about the design and implementation of database management systems. So that means that this is not a course on how to use a database to build like an application, like a website or something like that. Or how to administer a database, how to deploy one, set one up. We're not doing any of that. We're really focusing on how do you actually build and design the software that is a database management system. So that's not what you're looking for. If you're not looking to know how to build a database management system, then there's two other courses that you should consider at senior that are outside of SCS. And the one in particular you might want to look at is 95, seven or three at the Heinz College, which I think is information systems. And again, that's about how to like set up and administer a database management system, but you don't really understand. They don't really go to details about how you actually build the software. The other thing too that people often ask is about, there used to be another course at CMU 1545615, something that I and Professor Chris Felutas has taught in the past. That is not being offered this semester. I don't know whether it's going to be offered in the spring, probably unlikely. So right now, I think the only courses available in SCS would be this course 445 and then 826 in machine learning or cross-listed in the CS department. So again, there's only Chris, I myself, we can't teach everything. So unfortunately, there's no other database class other than this one. All right, so the course outline is that we're going to be going through and discussing how to build a disk-oriented database management system. And I'll explain what a disk-oriented system is in a few lectures, but basically we're just writing data out the disk. We assume the database is on disk. So it's sort of broken up. The outline for the topics we'll be discussing is broken up sort of into layers of the system, right? So we'll talk at a high level what relational databases are, and then we'll talk about how to store them, how to execute queries on them, how to run transactions on them, how to recover them if there's a crash, we need to restart the system. So up to that point, up to recovery, that's the core knowledge you need to have to understand how a database management system works. And then from there, we can then build on that and start talking about more advanced topics like distributed databases or various other types of databases that are out there or extensions of relational databases. So again, the way to think about this is we'll go through every single layer of how to actually build the system, and we'll finish up on the recovery. And so at that point, that's the basic need to understand of how a database system works, and then we'll talk about how to extend them to scale them up or scale them out or running them in the cloud environment. So right now, the course website is online along with the syllabus and the schedule. So the basic outline, there should be a lecture twice a week. And then with each lecture, there's readings that go along with it that are supplemental that extend the kind of things that I'll be talking about, as well also provides some course notes, which I'll mention in a second. So at all times, please refer to the course webpage. That should be up-to-date, have the most up-to-date information on what's going on. So unfortunately, we always have to talk about academic honesty. I'll go a little bit more detail of what I mean about this as we go along. But when you ever have a question and you don't know maybe what the right thing to do is, please contact me so that we can discuss whether what you're doing could be considered plagiarism or stealing somebody else's work, right? So in general, again, this is an advanced course, so everyone should be aware that you don't copy code randomly on the internet. You don't copy from each other. But just be very careful because we will check for these things. And as I said, all the discussion and announcements for projects, lectures, homeworks will be on Piazza. We'll do grading with grade scope. Your final grade will be posted on Canvas because that's what seemed to me once, but the day-to-day discussion will be on Piazza. And there's a link on the course webpage now that I'll take you to our page. There is a textbook assigned for this class, Database Systems Concepts. So this is actually a new edition that came out this year. I've looked at pretty much every single Database Systems textbook that's out there. In my opinion, this one's actually the best one. It's the most up-to-date. And as I said, I'll provide lecture notes for topics that aren't covered in the textbook. I have to admit, I haven't looked into detail of the seventh edition to understand how much it differs from the sixth edition. So if you want to get the sixth edition, I'm fine with that. I don't think there should be any major difference. I just may not know exactly how to do it. You can look at last semester's course and see for the different topics what the chapter numbers, because they have changed. And I'm actually not sure whether you can buy this book anymore. Like, you can't buy it as a bound book. They sent me a bunch of loose pages with three-ring hole punches in them. So I don't know what the bookstore has. I don't know whether it's available online in the sixth edition, but it's probably good enough. We won't be doing any homeworks or any problems out of the book directly. We'll provide everything for you. All right, so the breakdown for your grade in the class will be the following. So homeworks will be 15%, then there will be course projects, which I'll discuss in a second. That'll be 45%. So for those of you that are CS undergrads, because the final grade is comprised of 45% of projects, that's why this course counts for the system software elective, for the CS undergrad curriculum. And then there'll be a midterm and final exam, but that's 20%. And then there'll be an extra credit, which I'll announce in a few weeks, when you can get an additional 10% bonus points. And that's, again, that's entirely optional. So there'll be five homeworks throughout the semester. The first one will be a SQL assignment. It will give you a SQL-like database. You have to write some queries for us. But then everything after that will be pencil and paper, because it's a way to work through the more theoretical side of some of the things that we're talking about. But it'll be like filling out multiple choice, and then you just take a picture of it and upload it to grade scope, and we'll provide grades that way. So again, the first assignment is SQL, just because we actually won't be writing SQL for the rest of the semester, because the course projects don't require it. And I think it's just good for you guys to touch it at least once, at least when they advanced up. We'll talk about it next semester. So again, it goes without saying, all these homeworks should be done individually, as well as the projects. But I just want to emphasize this, that you're not allowed to work in groups and try to figure things out. It's not like a theory class. All these assignments can be done, and should be done individually. All right, for the projects, this is the one I'm pretty excited about. So throughout the course of the semester, you will build your own database storage manager from scratch. So you'll start adding pieces one by one and start building out a full-featured database storage manager. So the key word there is storage manager and not a database system, because you're not gonna be able to run SQL or have a query parser or planner, but you'll be able to run queries that are hand-coded that will provide you. So it's more complex than simple key value store, but it's not like a full-fledged system. So the key thing about this is that it's very important for you to keep up the date with the projects, because every project is gonna build one after another. So you sort of have to have the first project working correctly in order for the second project to work correctly and in the third and so forth. So this particular system that we're using this semester is written entirely in C++17. It doesn't mean we're using all the advanced features of C++17. It just means that it's not like C++99, sort of traditional C++ you may have learned in other classes. So because this is Carnegie Mellon, and I assume this is an advanced class, I'm not, and nor the TAs are gonna be teaching you or teaching you how to write or debug C++. I sent out the self-exam on Piazza. If you feel like you're uncomfortable with the gnarly aspects of C++, then you should try to figure out how you try to start learning this stuff now. But you can't come to us at the TAs office hours and say, hey, what does this stack-raise mean? This time it's really spent to be discussing the sort of the more high-level important database concepts that you're trying to implement in your code. So all the projects this year will be implemented on this new academic system that we've been working on called Bust Hub. So all the source code will be released on GitHub. Of course, obviously it won't have the implementation of the piece that you chose to implement, but you'll sort of fill that in. So at a high level, it's a disk-based or disk-oriented database management system that'll support volcano-style query processing. Different parts of the system have a sort of pluggable API so that we can plop in different replacement algorithms or different index data structures or different logging schemes or control schemes. So it's designed to be that way so that every year we'll switch up the projects entirely and have it be different from one year to the next. And we'll slowly build out the system further and further with new features, new functionality, so that after a couple of years we'll have a full-fledged database management system. So you guys are sort of the first one starting off with these first set of projects and then next year we'll modify them and it'll be different. For this reason, we can make it open source because I'm not worried about people next year finding your crappy projects, implementations and copying their code because all the projects will be entirely different. So this is what I was saying on the last slide that you're building basically a storage manager. The database system doesn't support SQL at this point and nor will it this semester. But you'll be able to write queries but you'll write them in sort of physical operator form rather than in SQL and then translate them because that's not something we're doing just yet. So the name of the system is Bust Hub. I will explain offline what that means but we had a nice logo made. And again, we'll announce this on GitHub, we'll post it on the link on Piazza in a week or two when we announce the first project. So we're pretty excited about having everyone work on this semester. All right, for the late policy for the homework and projects, every student is allotted four slip days. So basically for any homework or project you can say from a day late, you can decrement it from your count. So the B on each homework and the project submission, you just say how many late days you use and how many late days you have left. So sort of allow yourself to keep track of how many late days you have. So after you run out of slip days, then you'll lose 25% on the science total points every time it's for every 24 hours that it's late. So again, we'll just keep track of this as we go along in the semester. Obviously, if there's medical mutations or other issues that come up, please contact me and we can accommodate you. So again, as I said before, all these projects and the homework should be done individually. They're not group assignments. You're not allowed to work it out together and submit a single submission together. You should be doing everything individually. Now some code for some projects may be still online from previous years. Don't take that. We're gonna run it through boss, the plagiarism checker. If we catch you running their shitty code, unfortunately, we gotta report you to Warner Hall. So don't do that, because it's stupid, it'll fuck up your life and it makes everything harder, right? Just don't plagiarize, okay? And again, if you're unsure, check the seemiest academic policy or policy for academic integrity or contact me if you're unsure about what to do. And this includes also for the extra credit and just because it's extra credit and it's optional, doesn't mean that you can't also get caught for plagiarism. So don't do that as well. And I'll remind you every single time we put a new project out, every single time we tell by the extra credit, okay? All right, so if you wanna go beyond the kind of things we're talking about this course, if you really like databases, which I do, you have no idea how much I love databases. If you wanna go beyond the course material, there's two sort of ways to get involved in database research or other database topics going on at Carnegie Mellon. So the CMU database group has our weekly meetings on Monday's at 4.30 in the Gates Building on a floor. And this is other students, visitors from companies and people in Pittsburgh or from abroad coming and giving talks about the kind of research or kind of work that they're doing. If you wanna get involved in the development of a sort of advanced system, we have our team meetings on Tuesdays at 12 o'clock, also in the Gates Building. So we're building, in addition to Bust Hub, a Bust Hub sort of the academic system, we have a new sort of full-featured database management system that we've been building for several years now that, again, if you wanna get involved in this kind of stuff, you should come check that out. And I'll send a reminder on this Piazza. I will say also too, if you wanna take the advanced class, 15721 in the spring, all those projects are based on this other system we're building. So if you wanna get started on that and sort of learn how that system works and get involved in the early days of this thing, by all means come to this. Okay, so with that, that's it for the course. And again, please, if you have questions about things, post them on Piazza and I'll respond. All right, so now, let's talk about databases. The databases are super important in real life because they're used everywhere. So remember, every single complex or any computer application you can think of, at the end of the day, deep down inside of it, there's gonna be a database. If it's a mobile phone application, if it's running your desktop, if it's a website, it's some kind of complex computer simulation, at the end of the day, there's always a database. Everyone has database problems. Many things can then just be reduced down to database problems. So a database, the definition I like to use is that it's a collection of data that's related to others in some way that's trying to model some aspect of the real world, right? It's not just a bunch of loose files you have randomly sitting on your laptop, right? In some ways, it's a database, but it's not really a useful one because you can't ask questions about it. So this data is usually related together or have some common theme to them and it's trying to model some aspect of something that's going on in reality. So the example I always like to use for this class is that say we wanna have a digital music store, like something like Spotify or the iTunes store, right? And so backing this application will be a database that's gonna keep track of the various artists we have and their albums, right? So what we would put in this database would be basing information about their artists and then information about what albums that those artists have released, right? So that core right there, that's a database. So let's see how we actually could build now an application that could store this information. So let's say that we don't know anything about database management systems, right? We don't know about MySQL, we don't know about Oracle, we don't know about Postgres, we don't know about any of that. So in our own application, we're gonna write this code ourself. So the simplest database we could implement and manage in our application would be just one where we store our data in a bunch of comma-sebrative value files or CFE files. And then in our application code, we're gonna write the procedures, the methods to read this data and extract information we need to answer or questions or queries for them, right? So the way to think about this is like for every entity we have in our application or in our database, like the artists and the albums, we'll store them in a separate file, like artists.csv, albums.csv, and then we'll have some code that knows how to open that file up, parse each line to extract the different attributes about that these files are storing. But so let's say, again, we have two entities in our database, right? We have the artist and artist has a name a year in the country and then we have their albums, we have the name of the album, the artist put out the album and the year that put it out. Right, so again, if we're just storing this as CSV files, we would have quotation marks for each attribute and then columns that would separate them. So let's say now we wanna write a query that could look at the artist, you know, the artist file and try to figure out the year that Ice Cube went solo. So Ice Cube was a founding member of NWA here in LA and then he left them because of money disputes and he went solo. So if we have the CSV file that has this artist information, you know, we could write some simple Python code that would just iterate over every single line in the file, right, he would have a function that would parse it, just basically split each line up by commas, get back an array of attributes and then we just check to see whether the first attribute equals Ice Cube and if so, then we'll convert the second attribute to an integer because that's the year we just print that out, right? Really simple coded to enter this particular query. So there's some problems with this approach. So, and this will go through these problems of why you don't wanna manage data like this in your application and this will motivate for why we wanna build a sort of a general purpose or a database management system that can handle all these things. So the first question is, you know, how can we ensure that in our application that for every single album that an artist puts out, the artist field in that album file is guaranteed to be the same? Like how do we know that we don't have a spelling mistake for Ice Cube, right? And then if we end up doing that, how do we, you know, if say Ice Cube changes its name, how do we make sure we fix all those things, right? The next issue is how do we ensure that the data we're storing is valid for the different type, right? So the album year should be a four digit number but what happens if someone puts in a random string in that place, right? They can, anybody can open up a file and modify it because it's just a regular file on disk but now our application does that parsing and it sees a random string when it expects to see an integer and it's gonna throw an error because it's like someone modified this data in a way that I, that was not expected. And then the next issue is what if we have now an album that has multiple artists? Well, that's problematic because the way I set up my file, there's only one artist expecting field, right? So I mean, I could try to store that within the quotation marks, a bunch, you know, a bunch of color separated values inside of that thing but now I need to go look every single time I'm looking at the attribute and say, is this, you know, is the artist name an array itself or is it just, you know, just a string? So again, you have to write all the specialized logic to deal with these, you know, these particular problems in your application. So implementing this is not easy, right? So how do we actually find a record? So I showed my sample example as a for loop to iterate and parse every single line to find the record that I was looking for. And so, you know, my sample file had three lines so that's not big of a deal, right? So that can be done pretty fast. Well, you know, what if I had a billion albums? Do I really want to be opening the file every single time scanning and parsing every single one to enter every single query? No, right, because that would be really, really slow. Now, the next issue is that would have say, you know, our application that showed here was written in sort of, it looked like Python code, but what if now I want to use, write another application that's written in another language and then I want to use that same database, right? So let's say that the example code I shared you was running on a web server, opening up a file, parsing it, producing the answer, but now I have like a mobile phone application that wants access to the same database. Well, my mobile phone application might not be written in Python, might be written in another language, therefore now I have to duplicate all my logic to parse that file in my web app or whatever other application. And of course now how do I start sharing things, right? That becomes problematic as well, right? So again, same thing, what if I have two threads or two programs, two processes running at the same time that want to write to the file at the same time? What's going to happen, right? If I don't do anything special, then the first guy will write something, then the second guy might just overwrite it and I'll lose my changes to the first guy. So now I start losing data and my data ends up becoming invalid because it's getting garbled. So that's problematic. All right, the last issue is how do I ensure that my data is safe? So let's say that I'm updating a record, I open a file, I start writing to it, but then before I finish writing my update, the machine crashes, my program crashes. What happens, right? Should that update be there? Should it only be half updated? How do I reason about what the correct state should be? What if I, again, I want to say, well, I don't trust the machine that I'm running on so therefore I want to replicate my database, my files to a bunch of different machines so that if one machine crashes, the other one can just pick up and I can run, keep running without anybody noticing. And you say, oh, maybe I can use a distributed file system, but those things aren't general purpose usually and that can be difficult to do. So for a variety of these problems as well as others, this is why you don't want to write the kind of stuff that we talked about, of parsing a file and reading it in your application. You want to offload this or you want all that sort of complex logic of how to manage the database, you want a database management system to manage that for you. So a database management system is specialized software that allows applications to store and analyze information in the database without having to worry about the underlying details of how to do that, right? And it's software that can be reused from one application to the next so that you're not reinventing the wheel all over again. So a general purpose database management system, the kind of things that we'll talk about in the semester are designed to allow applications to define, create, write queries against, update, and miniature databases, right? And for our purposes, we're going to assume our databases are stored in disks. They don't necessarily have to be like the NMD databases or the GPU databases or other things, right? For these, we don't need to discuss those things yet, but just know that there's a variety of different databases out there that can do a bunch of different things that are specialized in different ways for a variety of applications. So again, I love databases a lot. I think about databases all the time. I love writing my databases or reading my databases. And you may think, all right, this is crazy. Why would anybody love databases with data management systems so much? So you have to understand and think about this. At CMU, we're a large school. We have courses and everything, right? You know, in this course is for operating systems, course and for networking, things like that, but database management systems are sort of a special class of software that are so important that like, there's full-time people like me that teach a course just on this, right? Like a web browser is important, but there's no class on like how to build a web browser, at least as far as I know, right? Whereas like databases are so prevalent and so widely used everywhere and they're really hard to implement correctly that there's an entire course, like this course and talk about how to actually how to build it. So again, I think they're a unique piece of software that, a class of software that is definitely a hot area right now, you know, at the end of the day, machine learning needs data, you store that in the database. And so every single application, as I said, at the end of the day, there's a database underneath it and running, almost everything. So this is why a course like this, in my opinion, is super important. All right, so databases are obviously, systems are not new. The first one came online, I think in like 1965 at General Electric and people that sort of in the early days of computing, the people quickly realized, hey, it'd be nice to have specialized software, like a database management system that can manage large data sets for us. So you have to understand back in the day, it's not like how it was now. In the early days, some things that we'll talk about in this course that we sort of take for granted now, because oh, of course, this is how you wanna do certain things. Back in the 1960s and 1970s, it wasn't obvious that this is the way to do things. So in particular, the story I like to talk about is back in like the late 1960s, there was this guy, Ted Codd, who worked at IBM Research. He was a mathematician, but he got hired to work at IBM Research in New York. And he noticed that people that were working on databases spent a lot of their time rewriting their database application over and over again because there was a tight coupling between the logical layers of what's in the database and the physical layers of how the data system was actually going to store it. So what I mean by that is like, say you have a database, and you wanna store it in this database system, and you'd have to tell the database management system, oh, I want you to store this as a hash table. I want you to store this as a tree. And then when you did that, it would then expose a different API for you based on what data structure you chose. But then now, say you change your mind, oh, I told you I was a hash table, but I really wanna run range queries on that. So now I wanna be stored as a tree. So what you'd have to do is dump all your data out, change your application code to now make calls for the tree API to the hash table API, and then you have to reload all your data back in. Right, because again, whatever, you're supposedly told the database management system how you want it to store the data. So the TedCon realized that this was kind of stupid, right, this is problematic because it was, people were basically refactoring the code all over and over again every single time there was ever a change like this. So now back then, humans were cheaper than the computers, and so, but you could easily see how this was not scalable. And nowadays, computing is cheap, with, you know, Amazon and Microsoft and Google, the cloud computing, and humans are expensive thing. So this problem is even more problematic, you know, this issue is even more problematic today than it was back in the 1970s, but TedCon, you know, quickly realized that people were wasting their time in fixing up software that they didn't need to. So what TedCon proposed was this thing called the relational model. So he had the first paper came out on this in 1969, but the one that everyone cites is this one from the communications of the ATM that came out in 1970. So most people read the one that the relational model of data for large shared data banks, but this is sort of the seminal paper that sort of started the whole relational data model revolution of saying this is how you want to store data in this kind of systems. So this is very influential and this is sort of the backbone of what we're talking about in this course. So there's sort of three key tenants of the relational model that TedCon proposed. So the first is that we were going to store the database in simple data structures as relations. So a relation doesn't mean like, you know, I'm, you know, I'm related to my parents or whatever. It, relation is essentially a synonymous for a table. All right, it's the relationship between the attributes stored in the tuple for that given table. So again, we would define at a high level that we're going to store all our tables at our database as these relations and then we would access them through a high level language, meaning we wouldn't write the exact code that we want the database to execute in order to retrieve the data that we wanted. We would just say, hey, we want you to compute this answer. Please do it for me. We wouldn't actually say, actually haven't do it. So this idea was actually pretty, was really revolutionary because back then everyone was writing explicit procedural code. You know, here's the for loop to iterate through the table and find the data that I wanted. The example that I showed in the very beginning. This is how people write database applications before the relational model. And this is actually pretty controversial at the time because everyone was saying, oh, there's no way software can ever produce a query plan that's as efficient as what a human can do. Right, this is sort of the same argument that people made in 1970s about compilers. They said, oh, no compiler could ever write, generate machine code as efficient as handwritten assembly. And of course nowadays nobody writes, or very few people write low level assembly. Everybody writes in high level languages and the compiler does a pretty good job producing the machine code to execute it more efficiently than what a human could do. Same thing was said back then, right? The database system couldn't produce a query plan that's as efficient as what a human could do. And of course nowadays we have very complex query plans. Map reminers aren't perfect, but they can probably do a better job than what most humans could do. All right, and then the last key idea was that the physical storage strategy for a given database was left up to the implementation of the database maintenance system. So again, we define our database through these high level structures as relations, but how those relations are actually stored is really up to the implementation of the database system. So the relational model doesn't say whether it should be in memory, it should be on disk, how should we lay it out on disk, or how should we organize memory. All that is transparent to the application. And so now the benefit of this is that if for some applications their database may want to be stored in one way, and then over time if the application evolves, it may be better to store it another way, and then we don't have to rewrite our application because we're still writing at a high level language like SQL and we're still accessing these relations, but underneath the color of the database system can start moving things around for us or changing its layout or recompiling certain things, changing the data structures, and we don't have to change any of our application code. So we have a clean separation between the logical and physical layers, which is absolutely what we want. All right, so the relational data model is not the only data model though. It's certainly the most widely used data model, and in my opinion, I think it's the best data model. But there's different data models for different types of workloads, and the relational data models sort of catch all that can encompass a lot of things. It's not to say that the other ones are bad or wrong, it's just a relational model nine times out of 10, that's probably what you need. So again, the data model is the high level concept of how we're going to describe the data in our database, and then the schema that we provide for a given collection of a data in database is the definition of what we're actually storing it. So I'll show you what I mean by this in a few slides, but the data model is essentially the high level concept of how we organize the data, and then the schemas to say, all right, what is the data we're actually storing with this application for this given data model? So again, as I said, the relational data model is just one of several other data models. I'm showing a small sample here. Again, most database systems that you know about today and the ones we'll cover in this course, my SQL, Postgres, Oracle, DB2, SQL Server, SQL Lite, all of these are relational database systems, use the relational data model. If you're familiar with the term no SQL, the no SQL systems are usually these key value graph, documenter, JSON, or column family data models. And again, I'm not saying necessarily that the relational data model is better than these other ones. There's certain application domains where some of these could better describe the data than the relational data model, and the relational data model has certainly adopted some of these, some concepts or ideas from these other data models in newer versions of it. It's just, you know, the relational data model can model all of these things. It just with bearing benefit, or advantages over the basic data models. All right, you can also store arrays and matrices. That's considered a data model. This is mostly what people use in machine learning. There's some databases that can store matrices and arrays. They're not that common. People usually use data frames or other things. That would be an example of a matrix data model. The hierarchical and network data model or the multi-value data model, there's a bunch of these other ones, these are more esoteric. These are what the original data models were used in the 1960s and 1970s. You know, sort of around the same time the relational data model came out. You almost never will see these things. If you're like a new startup or building an application for the first time, you definitely don't want to use any of these things. They're mostly in legacy applications. It's just the remnants of old software. I would say this is not necessary to even consider them. So again, for our purposes, we're focused on the relational data model. That's what we care about in this course and that's what we're gonna focus on. Because again, relational data model can be used to model anything. So the relational data model is comprised of three parts. So again, the first is the structure of the relations. This is like the schema, how we're gonna define what are in our relations, what their attributes are, what their types are and so forth. Then we'll have the integrity constraints that we'll define to specify what is a valid instance of a database given the structure, given the schema you provide. And then we'll have a way to manipulate and access the data in our database. How do we actually run queries that can either extract information or modify the contents? So let's go back to our music store example and look at some concrete examples of what a relational model database looks like. So again, a relation is just an on order set of elements or records that have attributes that represent entities, instance of entities in our relation. So again, an artist has a name, has a year in a country and we can see in our relation example here, we have for each one record is sort of one row in this diagram and we have all those attributes. So we would refer to a record in the relational data model as a tuple, right? It's gonna be the set of attributes for that instance of an entity within our relation. So in the original data model, sorry, the original relational model written by Ted Codd in the 1970s, all these values had to be atomic or scalar values, meaning they couldn't be arrays, they couldn't be nested objects and so forth, right? You always had to be sort of, you know, with one string, one integer, one float. So again, that's what the original data model, relational data model talked about, but in recent times, you now can store arrays, you now can store JSON objects in relational databases. That particular strain has been relaxed. Now, there's also gonna be a special value in the domain of every aspect of a store that called the null value, that means that the value is unknown. So that'll cause some problem when we start running SQL queries, but pretty much every database management system supports relational database systems, supports null values. But how you actually store them is left up in the implementation and there's a bunch of different ways to do that, which I think is pretty interesting. All right, so then just, so you have to understand the parlance will say that an array relation is one, is a table that has m columns, right? So I'll use the term relation and table interchangeably, they mean the same thing for our discussions. I'll say tuple sometimes as a record. I'll try not to say row because that's something very specific when we actually start talking about storage models. So, but all of those words can be used interchangeably as well. All right, so one thing that we'll also have in our relational model is primary keys. So primary key is gonna be a unique attribute or set of attributes that can uniquely identify a single tuple. So in this example here, we actually don't have a primary key for any of our attributes because there could be other artists called the Wu-Tang Clan, but there aren't, but there could be, right? There's nothing in the world that prevents somebody from calling them that themselves that. So we can introduce a new ID field as a unique primary key so that you know if you're looking at the tuple with the ID one, two, three, you're explicitly looking at the original Wu-Tang Clan. So in this example here, this is like a synthetic key that I've introduced into my relation, my schema here, right? In a real artist, they don't have an ID one, two, three. This is something we're using in our database minute system for this particular rate in relation to uniquely identify the tuple. So there's different ways to generate these things. So there's some systems, at least in the SQL SQL standard, when you can automatically generate these auto incrementing keys. So every single time I insert it in tuple, there's some counter that increments by one so that that new tuple gets that unique identifier. Other systems, if you don't specify a primary key internally, they'll create one. They don't really expose it to you, but they'll use that to keep track of this as the particular physical record or physical tuple that you're looking at, right? For this example here, it's a synthetic one that's exposed through the logical layer. So this could have been generated through an auto increment key. There's also foreign keys in the relational data model. So foreign key is a way to specify that an attribute from one relation has to exist in at least one tuple in another relation. So this is how you're gonna maintain integrity across different relations so that you don't insert things that map to unknown or non-existing entities in another table. So going back again to our example here, so we have the artist, we have the album, and so for the album relation, we may wanna store multiple artists that could all collaborate on an album together. So we have this artist field now here, but in this particular example, for this next tape, there's multiple artists, so I can't encode multiple values in the artist field because I said there had to be a scalar. So instead, I'll make this separate cross-reference table here, where I'll have a foreign key reference from the artist ID that is part of this album and then the album ID that they're involved in. So now I can store multiple artists on a single album. And the data management system will keep track of whether these albums and artist IDs actually exist in the relation so that prevents me from inserting an album that has artists that don't actually exist. The data system can protect me from inserting bad data. And again, I avoid having to implement all that extra code myself in my application, which again, relieves me from writing really hard code. All right, so now we wanna talk about how we actually wanna get data out of our database. Again, I showed that simple example of writing the for loop, but I said that was a bad idea. We don't wanna have to keep doing that for application. So what do we wanna do? So the DML is a way to manipulate the data and either accessing it or modifying it in a way to produce the result that we're looking for. So there's two ways we could do this. So the first is procedural, meaning we'll specify at a high level strategy of how the data systems should find our particular result that we're looking for. So I don't mean exactly like the procedural code, the Python XE about two or four, but just sort of a high level say, hey, this is what we want when you do execute. The other approach is called non-procedural or declarative, and this is where we're gonna say, this is the answer we want the data system to generate for us, but we're not gonna specify how to actually produce it or actually how to go about in getting it. So we're gonna say, hey, at a high level do this and we hope the data system can figure out how to have an efficient way to do it for us. So we're gonna talk about now is relational algebra. And this is the example of a procedural language. And relational algebra is not hard to understand. It'll come up later on when we talk about query execution. But this is what Ted Cod originally proposed in the 1970s. A non-procedural example would be something like relational calculus. And so we don't need to discuss this at all. Not because I don't think it's actually necessary, at least for our purpose in this course, but relational calculus will come up if we start going in detail about query optimization because you have to use calculus to derive efficient query plans and new pruning and things like that. So again, for this course, we all need to focus on relational algebra. I'm just bringing up relational calculus to say that it exists, that it's out there. It's an example of a non-procedural language, but for our purposes here, we don't need it. All right, so Ted Cod proposed seven fundamental operators in the relational algebra. So these, again, these are the fundamental operators we have to retrieve manipulate tuples in a relation to produce answers that we're looking for. So one important thing to mention is that this algebra is based on sets. So the set is an unordered list or unordered collection of data where you cannot have unique values. Now when we talk about SQL, SQL is not gonna be based on set algebra even though it uses relational algebra as the underlying operators for doing query execution. And I'll explain again when we talk about SQL next time what that means. So the way this is gonna work is that each of our relational operators is gonna take in one or more relations as its input and then it's always gonna output a new relation. And so the idea is that we're gonna compose complex queries by chain together these relational operators together to produce the answer that we're looking for. So these are like the primitives you would use to do some operation on the data on tuples and then to produce a more complex query, we put them together. So we select projection, union, intersection, difference, product, and join. So again, these are the fundamental ones and we'll go through each of these. So the first one is select. Select is basically taking a subset of the tuples that satisfies some selection predicate. So you can sort of think this is like a filter. You say here's a given input relation, here's the predicate that I want you to evaluate for every single tuple. And if it evaluates to true meaning the predicate is satisfied particular tuple, we'll add it to our output relation for the select operator. So the easy way to remember this, the word select starts with an S and so the relational operator is a sigma which also starts with S. So you would say, here's, I want to, I want to, for this given input relation, here's a predicate to run and produce our output. So doing a real simple example here, say we have a simple relation, has two attributes, AID and BID. And so we could do a select operator like this that says find me all the tuples where AID equals A2. And then our output would be a new relation with these two tables and the same schema or the same attributes as the input relation. I also can combine together using Boolean logic, combine together more complex predicates. So I can say we're AID equals two and BID is greater than one or two. And again, we're not specifying in the order in which you apply those predicates, because it could be more expensive to do one versus the other. That's all left up to the data system to decide on its own what to do. So now if you know SQL, the way this would write, we just write this in SQL for that second example, is to think of the select operators like the where clause. So you say where AID equals two and BID is greater than one or two, right? That's a translation of this select operator into SQL here. So the next operator is projection. And here what we're doing is we're gonna generate a new output relation that contains only a subset of the specified attributes from our input relation, right? So you would say here's my input relation and then my projection is here's the address that I only want you to omit to produce in the output. And again, projection starts at the P and uses a lowercase pi symbol. So it's easy to remember which one it is. So again, here's our simple example with relation R with AID, BID. So in this case here, I can have a, I first do a select to just produce all the tuples of R where AID equals two. And then now in my projection clause, I can say I want the BID value subtracted by 100 and then as well as the AID value. So I can reorder my attributes any way that I want and then I can manipulate it in any way that I want with any kind of arithmetic or kind of string function operator. So again, in SQL, it would look like this. So you have in the output clause or your select statement, you would say BID less than 100 and AID. And then the where clause is just, you know, what you did in the select operator. Our unit operator is where we're gonna take two relations and we're gonna produce a new output relation that contains all the tuples in either the first relation or the second relation or both of them. You just combine these two relations and match them together. So if you do the union on R and S, you get a giant, you know, you're basically concatenating S onto R as the output. And so for this particular operator, you have to have the same attributes and the same type in the two relations you're trying to union together. Right, so in this case here, if S had, you know, if had a third attribute, you wouldn't be able to be union because it wouldn't match. So in SQL, there is a union all operator. So the union one is not exactly the same as the union operator in relation algebra. You have to add this all qualifier to get it to do exactly what you wanna do. So again, we'll discuss what the difference of these two is in the next lecture. There's also intersection, same thing. So now what you're doing is taking all the, you're taking for both the two input relations, you're just gonna produce the output relation that has tuples that appear in both the two input relations. So again, taking the union R and S here, we'll just produce a single tuple that has, you know, sorry, single relation that has one tuple, A3 and 103. So again, you have to have the same number attributes to the same type and the same name in relational algebra in order for this union to work. And then there's the intersect keyword in SQL that will do the same thing. Difference is where you take all the tuples that appear in the first relation but not the second relation. So again, I can take the R, the difference of R and S, it's just all the tuples that don't appear in an S that appear in R. In this case, A3, 103 appears both in R and S. And so it's excluded in the output but A1, A2 don't appear in S, so they're in the output. And in SQL, you use the accept keyword or accept operator to perform the same kind of operation. All right, then the second last one to do is the product operator. So this is also sometimes called a Cartesian product or if you know SQL, this is I think called a cross join. So basically you just wanna generate all combinations of all tuples from the two input relations. So you're basically just taking the cross product of these two and just producing a giant output every single unit combination of all of them. So again, in SQL, you use cross join or if you don't actually specify that what the join is at all, you get this output here like that. So you may think this is kind of stupid and this is like what you might wanna do. You never wanna do this but this actually appears sometimes in testing, some testing applications where you wanna say, give me all the unique combinations of different configurations of the thing I'm actually trying to test and this is a really simple way to actually do that. All right, the last I'm gonna talk about is the joins. So this particular join is called a natural join. It's not the kind of joins we'll talk about in the future but not necessarily the type of join we're talking about in the future but this is very explicitly talking about a natural join. So the way the natural join works is that you, for every single tuple in the output out one relation, you see whether it matches all the attributes in the other relation that have the same name, the same type, so the common values of them, right? So in this case here, if I wanna do a natural join on R and S, I look at AID and see whether I find a match. I look at the AID and R, see whether I have an AID match in S and then if I do, then I check to see whether I have also a BID match from R and S and then if I do, then that's produced as the output here. So again, A3103 exists both in R and exists both in S and it can produce the output. So in SQL, there's a natural join operator like this. Again, notice I'm not specifying how I wanna do the join. It knows that I'm looking to see whether I had matches based on just the name here. So the reason why this is different than the difference because there could be additional attributes in R and S that don't have the same name in shared across the two relations and so they will actually be produced in the output. So the scheme is allowed to be different as long as there's some common attributes that you're doing the natural join on where in the difference operator, you have to have exactly the same attributes of the same type. So these again, these are just the original main seven relational operators, relational algebra operators that Ted Copp proposed. Since then, there's been a lot of research and been a lot of additional stuff that people have added, right? So like there's things like doing sorting or order pies where we know we wanna do that SQL. The original relational model proposal doesn't talk about these things. Relational algebra proposal doesn't talk about these things but people have since extended the relation algebra and include these other ones. So doing rename, assignment, duplicate elimination and aggregations of group buys, sorting division. So again, we may see some of these, we'll see certain see aggregations in sorting later on. Just to say that there's placeholder algebra operators that we may have to consider when we start doing query plan or doing query execution. All right, so to finish up, I just wanna sort of point out one particular thing. So the relational algebra is still pretty high level compared to like our for loops as before because we're not specifying anything of what they have the data stored in terms of like its data structure and so forth. We're just saying it at a high level, hey, scan this table and do a filter, right? But it's still at the end of the day, it's still kinda telling you what steps, the order which you should perform these steps. So again, let's say I wanna do a join between R and S and then I wanna filter out any tuple where only produce the tuples where the BID equals one or two. So I have two examples of two relational algebra expressions to execute this query. So the first one here, I do the natural join R and S first and then I do my filter on BID equals one or two. And then in the second example here, I do the filtering on S first where BID equals one or two and then I take the output of that relation and then I do the natural join. So again, you may think at a high level, at a high level, these are the same thing, they're producing the same result but the performance characteristics or how efficient the data system will execute these two query plans can be vastly different, right? Like say I have a billion tuples in R and S and there's only one tuple in S where BID equals one or two. So if I do the join first, I'm gonna take a join of billion tuples and then I'm gonna scan through and just find that one tuple where BID equals one or two. Whereas in the second plan, I could do the scan on S first, find that one tuple where BID equals one or two and then now I'm joining that one tuple with the billion tuples in R. So again, even though we're not getting at the low level, like you run this for loop and do that, it's still specifying relation out of the steps we actually wanna, the data set we used to execute for our query plan. So what we really wanna do is be able to just say at a high level, tell the database system, hey, this is the answer I want you to compute and not actually specify how you want it to compute it. So I wanna say, hey, go just give me all the tuples to join from R and S where BID equals one or two. Didn't say whether you should scan S first or join with R first. I say this is the answer that I want. And the reason why we wanna do this again because now in our application, we just specify this high level answer we want the system to compute. And if our database is really small today and then it grows really big a year from now, I don't have to change that by code. The database could figure out, oh, we had a small database before, I wanna execute it one way but now you have a billion tuples in S maybe I wanna do the filtering first but I didn't have to change my application code to make it do that, right? The database could do this for me. So this is what we're gonna try to do by running queries, in particular we wanna run SQL. And so the key thing to understand is that although SQL is the standard way people express queries for on the relational model, it isn't the only way you could do this. It just happened to be the one that everyone uses and standardize on. And actually when Ted Cod wrote the relational model paper first in 1970, he actually didn't even propose SQL. He didn't even propose any high level language. He just said, hey, here's a relational algebra. He then later proposed his own language called alpha that was a competitor SQL in 1970. Of course, you've never heard of it because no one ever used it. And actually in the 1970 there was two other competing, there was another competing languages SQL. There's another thing called Quell out of Berkeley for the system called Ingress. And they both look similar. Quell and Alpha and SQL sort of look similar to each other but of course the syntax is very different. It just happened again, SQL just sort of won out it's what everyone uses today because IBM invented it. So the way you think about this is that, again, this is what I showed the very before. This is like the lowest level way you could actually execute this query of finding when Ice Cube went solo. But instead I could just write a really simple SQL query like this and say, this is the answer that I want you to produce. And then the data system could then figure it out. Oh, I want to write it in this particular for loop because that's the best way to do this. And then if I add indexes, then the query plan could change, right? So the beauty of the relational model in SQL is that we can write these high level queries that we want and then over time as the database changes, as the system itself changes or our workload changes, it can adapt and improve itself without us having to go back or change our application. Again, for this reason, that's why I'm super excited about the relational model and I think it's always the right way to go. All right, so just to finish up here, and again, I realize it's awkward for me sitting here in the bathtub giving me this lecture. So I appreciate you guys sitting through this. So databases are ubiquitous, databases are super important, databases are everywhere. So no matter what you do throughout your life, even if you don't go in the field of databases, you're not only going to the field of computer science or computer programming, I guarantee you throughout the rest of your life, you're gonna come across databases and it's gonna be important for you to understand how they actually work on the inside, right? Because when things don't, things aren't working the way they think they'd be working, you need to know, oh, it's written this way, this is what it's actually doing so that you can figure out how to actually improve things. So we discussed the relational model, we discussed the relational algebra and we showed how these are the primitives we can use to essentially use them as the building blocks to generate queries that we can execute on a relational database to update it and derive answers. So we'll see relational algebra again, later in the semester, when we talk about query optimization, query execution, but just in the back of your mind as you write SQL queries for the first homework, you should be thinking about, oh, well, how would, you know, the database system is translating these into relational algebra, sort of thing about how it is actually doing it, how it actually execute these things. Okay, I almost forgot one last thing. The most important thing you need to understand about databases through the rest of your life is the following. When you look back at the 36 chambers, you can understand who were the original nine involved in it. You have the RZA, the JIZZA, inspected deck, ghost face killer, master killer, U-God, meth man, old dirty bastard, Ray Kwan. But the other important thing too is Capodana was in jail at the time so he was actually an original member of the clan but because he was in jail he couldn't be on the 36 chambers. So that's the most important thing you need to understand throughout this entire semester, okay? All right guys, see you next time. Oh, we're coming through with my shell and crew Too sent for a case, give me St. I's crew In the midst of broken bottles and crushed up cans Met the cows in the gym or I'll drop He's with St. I's in my system, crack another unblessed Let's go get the next one, then get over The object is to stay sober, lay on the sofa C'mere's the bottle, I'll be jam-stressed out Can never be sun-rich and say jelly Hit the deli for a part one, naturally blessed, yes My rap is like a laser beam, the fulge in the bushes St. I's been like in a team Cracked a bottle of the St. I's Sippin' through those who don't realize The drinkin' ain't only to be drunk You can't drive, keep my peoples alive And if the St. don't know you're full of can of pain, pain