 violating fire code here. I guess you can kind of maybe sit up there at the front, or you sit on the floor. All right, so a lot to talk about. First of all, we have a DJ again. Give a round of applause for DJ 2PL. All right. You're the fiercest DJ DJ that's not in jail right now. Do you want to do a quick preview? What would you do to everyone? All right. Another big announcement, too, is that for the first time, we now have two faculty teaching this course. Jay Ganesh Patel is in the front row here. He's just from Wisconsin. We're happy that he's here. So he'll be helping me throughout the semester. OK, so let's get going. As I posted on Piazza, there's another course lecture. We already have that. They get moved to out of jail. So we're not going to go over a bunch of those things all over again. I just want to cover some things that, if you might have missed, that are important to understand throughout this lecture. And then we'll jump right into the material. All right. So the first thing, we have a core sponsor, single store, cloud-based database system that supports sort of hybrid workloads, transactions, analytics. As I said in the first lecture, it is a in-memory column store that uses skip lists with just-in-time queries and distributives shared nothing. If none of that means anything to you right now, that's fine. These things are actually coming at the end of the semester to give a guest lecture. This is a long time. They originally called it MemSQL, and then they changed the name to single store. So it's a state-of-the-art database system that's going to incorporate a lot of the things that we'll discuss throughout the semester. The crowd is huge. Obviously, not everyone's going to be able to do that. So if you're here trying to get up the wait list, the bad news is because the wait list is so huge, the department took it away from us. And we have no control over it. Because they're trying to figure out who needs to graduate, who is to get the system elective, and all that stuff. So if you come and ask us, hey, can I get up the wait list? The bad news is that we don't control it anymore. The good news is because we have Gignesh now that we'll be teaching this class once every semester. So if you can't get in this semester, you'll be able to get it next semester. And then the admins will move people out the wait list as people drop. And that's become available. So if you find a different department, you can get them where? Departments. Things I found out since I became faculty. So again, unfortunately, the bad news is if you're not enrolled now, you're unlikely to get them in the spring. For all the important announcements, everything will be in piazza. Project Zero, which we talked about last in the first lecture, that's now available on Gradescope and on the website. The final grades we showed through Canvas and everything will be submitted through Gradescope. Again, because people watch these lectures that aren't at CMU, outside of our university, if you want to test your projects in the same way that you the way CMU students are, there's a separate Gradescope account you can go to using that code there. And you can start and test things out. And we'll release those projects as in exchange for people to watch this. In exchange for making this all available to you, I don't think you have one yet, do you? I got banned because there's shit about there that is sort of true, where they said I was born in the streets of Baltimore. I am from Baltimore. So whatever. And if someone could finish that, that'd be great. And I did write a comment, hey, I am from Baltimore. That's not from the streets. So I want to cover it for logistics. This is a big class. Obviously, there's a lot of people here that may have different backgrounds. And so we want you to interrupt us as we're going throughout the semester. But if you have questions about talking about that's a fine question. I also get very excited when I talk about databases. Again, my wife, biological daughter, number one is databases, and then nothing else. I don't talk about copies. Literally, it's just databases. To confirm, it is just databases, yes. My parents are Trump supporters. We don't talk to them. It is just databases. So it's funny to you, right? All right, so again, so it's not fast. Or if you have questions, you can interrupt us. What we're not going to do is that at the end of the class, have people line up and say, what about slide five? The material we discussed during the lecture as we go along, because I want you to interrupt as we're talking. Because if you have questions and somebody else likely have questions, too, for us to answer these things and discuss it. And from a pedagogical standpoint, it's better for us as well. Because I go back and watch and I see where people ask questions. And I see, oh, that's what you didn't get quite right. Go fix the slides. So having the questions as we go along is helpful. All right, any questions about these logistics? All right, boom, so let's go right into it. OK, so we're going to talk about sort of the ground what databases are why it's important, why we need database management systems, why this course exists. Not just because we do stupid niche, but because it's a super important topic. And then we'll talk about the dominant model for how you want to use it in a database. Then we'll talk about relational algebra, the mechanism, or the operators we're used to interact with a relational model database. And we'll finish up because we have to talking about alternative models that people think of to a relational model. And I have strong opinions of why they're wrong. OK? All right, so the second most important thing in my life. Can I take a SQL server? Yes. Yes, the question is, if you're on the wait list, can you get access to Piazza? Yes, we'll post the SQL server now. We'll make that available, yes. So here's a SQL server in the back. My SQL. Yes, yes. One more, anybody? Cassandra, OK. I think everybody's listed, SQL server, Postgres, Mongo. There's all database management systems, right? We're talking about databases. And don't feel bad that you're not the only one making this mistake. This is Jenny, I think, this year. Let's see if it plays. Tech beat for four, please. A relational one of these systems presents the information to be stored and retrieved in rows and columns. Justin. What is the matrix? No. OK, all right. It'll get worse. OK. All right, SQL database. Not again, I'm being pedantic here, but it's good to understand the distinction between the two. Because when we talk about the data models, you don't understand, OK, we don't care about the system implementations yet. We don't care about what the data actually is and how we're going to interact with it. So the unorganized collection of data that's related to each other, that's meant to model some aspect. Everybody lives in different systems. An example of a database, though, would be the university database that keeps track of what students are enrolled in what classes and your grades. Because that's trying to model the real university interactions that students are taking classes and getting grades. This course is so important because the database, a database, is going to be the fundamental component of the underlying of any application or anything you're going to do with the rest of your life. So if you're a CS student, no matter if you go off and get a job doing CS related things or you're not a programmer, at the end of the day, you're always going to be interacting with the database. Even if it's an Excel spreadsheet, database is one. And this course is important because because you're going to end up interacting with database systems, you need to understand what's actually going to be inside. Even if it's a small application, like running on your cell phone, or a massive application with gigabytes of data, it's important to understand whether it's queries or when you use the database system. What is it actually doing? And what are the trade-offs you would make for how you use one data model versus another, or one database approach versus another? What's the most widely deployed database system in the world? Am I taking a guess? Everyone has a cell phone, right? SQLite is on every single cell phone. And it's written by one dude down in North Carolina. But the helper doesn't have an address. He's like a new mad. So this is an example of a database. So we talk to the university application. We talk about something more relevant to everyone here. Same thing like Spotify or like Bandcamp or iTunes or music. So we create a really simple database that has basically two entities. There's albums that artists appear on albums. So the thing we need to store in our database is going to be like, for every single artist, you know, what is their name? What country did they start? And the album could be like, what's the name of the album? When did it come out? What artists appear on it? And so what we talk about now will do a sort of strawman implementation of a database database. And we'll see why it's stupid and we see all the problems that it's going to have. And then that again, that'll help motivate why we actually want to build a full-fledged database or rely on a full-fledged database. So the easiest way to implement a database system of the manager's data is to just use files on disk. Right? So for every single, for my two entities, I have an album, I have an artist, I have a CSV file, comma-separated value file and every single file, text, every single line, right? Every single line I want to find in my, I'm gonna open it up, you know, parse it, line my line and try to find the data that I'm looking for. Right? So say I want to do something with the CSV, parse the CSV, magic word parse basically, it's just splitting on the commas. Then if I find that the name of the ensemble then I print it out, right? It works, but this is bad. Why? Yes, it says it's gone really, really slow when the number of rows increases. Yeah, so always to think of extremes in software. It's just three records here. Yeah, I can load that up in a single page and parse that pretty quickly. But if I have three billion records, then yeah, and if Jizz is the last entry, then I gotta parse every single line by line. Yes, somebody's here. She says that the insertion deletion would be really horrible given the system restrictions. So insertion wouldn't be that bad, right? Cause if I had to append to the end of it, who cares? Right? Right, so she said, what if I keep thinking forwarded by release year? Then yeah, I have to go find where it should be. That would suck, right? You also said update or delete, right? If I want to delete entry, I get the same problem as looking for it. I gotta find the thing I'm looking for and delete it. So those are importation issues, but there's some other problems as well. So your statement is that I'm representing everything as say strings, and then if there's a what, like a, correct, yeah, so here comment is, I'm storing the years as, well in this case as text files, so it's ASCII text, but my magic parse function, which I haven't defined, is somehow knowing, oh, this should be an integer and it's gonna call A to I, it's gonna cast the var star into a, the string to an integer. And then it's somewhat of the back. It's just a bunch of files on debt. A parsing error, you're corrupt. That's a good one. What are some other problems? You say reading, writing the file, so maybe we sort of covered that, right? Like, if you're inserting at the end, no big deal. Reading, could it make it slow? Yes. We said there's a comment in the name, right, break. The, like, Python packages for C plus that. Yes. Yeah, so you said you're not allowed to have current use of the file. Now, both the writers, things get a little dicey. Yeah. All right. Yeah, perfect, yes. The schema of the file, the schema of the, or the relation or the database, is implicitly, that'll cause problems. We'll see in a second. So we covered these, right? So how to make sure that going back real quickly, right? So I have Wu, I'm gonna go fix it. I gotta make sure that every single place where I have Wu, it all gets updated the same. Again, these are all on disk. The operating system, or the file system doesn't know. It just knows that, you know, whatever you wanna read and write. So we now, in the application, go, we make sure all the entries when we make corrections. She brought up the issue of what someone over writes a year with invalid string, right? We'd throw a parsing error. In my example here, I only have, I'm assuming there's a one-on-one correspondence between an artist and an artist. We can only have one artist, but obviously we know that's not the case in a lot of albums. So how do I go change my files now to go account for that? What if I'm late an artist who has albums? What should happen? Right? Well, in my example here, nothing will happen. And now I got a bunch of albums that have an artist name that doesn't map anything to the... Right, so the reference on integrity gets broken. We brought up a lot of implementation issues already, like how do you find it? In my example here, it's O-N, right? It's a, I have to do a brute force scan, a scruncher scan across the entire file to find the thing I'm looking for. Now, if I assume that there's only one person that, you know, only one artist with a given name, then I can start as I'm done. I start at the same name all the time. So the albums that, I don't know, Gizzo puts out, just like Gizzo and Florida or something like that, I got to keep scanning the entire file. And that's going to be N, or the order of N. In my example here, again, I have a bunch of things that's operating on the files. But let's say I want to rewrite it to now run, I want to write a new application that uses the same database, but I want to write it in Rust, because that's the hot thing. So now I got to go make sure that whatever information or implicit scheme I had in my Python code now maps over to my Rust code. But then now, if my files change, I got to make sure I go change both the Python code and the Rust code. Maybe say the guy that wrote the Python code, you know, is in jail, and we can update it, right? Now, what do I do? I'm going to break. What if the app is running on the same machine where my Python code is running, right? It's running like a microservice on a separate box. How do I get access to that other, how do I get access to that file? And then, what happens if you have two threads trying to write to the same file at the same time? Well, I mean, you can rely on the operating system to do, but again, if I have a billion records and I want to update one of them, don't want to lock a billion records, just do that one up. Ideally, when I have something more fine-grained that can have multiple threads, not something that no one really bought was, how to make sure that our data is safe, if there's a crash? I'm going to go to the file and then I crash before I F-sync. Some of that data actually might've gotten at the disk. So when my computer, I'm going to see torrents, which is not what I want, right? Because now I've corrupt the data. Or now my data is really big and I have a lot of people want to access it. How do I make sure that I can duplicate it, replicate it across multiple machines so they can all service reads at the same time so my application can scale up? Files aren't going to do that for me. So this is just a quick smattering of the management system, right? And why, of course, like this in my opinion. So as our management system is going to be the software that can allow applications to store and analyze information in a database, then ideally not have to worry about all the things that we just talked about. Now it's not always been the case. There's going to be trade-offs to the very, you know, see guarantees in exchange for performance, right? And a general-purpose system typically allow you to adjust, you know, what the trade-offs actually are. And so the database system is going to define or specify inherent instant implementation, how you can go create databases, how you can query them, how you can change it, you can administer the databases, making changes and other things, and sort of correspondence of some kind of data model. Well, I'll say also too that this is, again, no matter whether you go off and actually build database systems, you know, in your day-to-day job, this is going to be important again to know where you go because at some point you're going to be build some application when you have to decide what database benefits I want to use. And that should always be sort of the first choice, that like you don't want to do my, go cheer, but you have a lot of problems. So it's like you're like at a startup and you're trying to build an application or bring your application. Startup doesn't have a lot of money. You're stressed for time to try to get things out the door, to get the first version out the door. Do you really want to be spending your time making sure that your data is safe and there's a crash? Right? Because in the end, that's not a distinguishing feature of your business or your application. Nobody cares that if you actually lose data, so if you crash, you don't lose data. That's considered like, you know, this point in modern computing systems. And so rather than spending all this time trying to rent your own bespoke database system, you should just use something called, chances are it's probably progress in the end of a hundred. That's the first choice you should use. Maybe see it like, but Postgres will get you pretty far at any point that I'm talking about here. So again, a data model is gonna be, again, how we're gonna represent a, entities or a collection of data in our database. So the data model is this high-level abstraction specifies how we represent individual entities and potentially they're related to. And again, we're trying to model some aspect of the real world. Students take classes, students get grades, they're at home department. And then the schema is gonna represent the definition or the description of what those entities in our data model of our database look like. Students have names, students have birth dates, have email addresses and so forth. You would specify all of those in our schema and then we would instantiate instances of a database based on a data model according to the schema. So there's a bunch of different data models that are out there. So this goes back to the very beginning of databases. People will sort of realize, oh, we need to represent is in a database. So most database systems today, something that looks a lot like the relational databases. Relational data model. There's things called object relational with what Postgres is. It just means you can have user-defined types to extend what your relations are or extend the data. But at a high low, most database systems are gonna be relational. I think there's that DB engines, the ranking, the most popular database. I think the four out of the five most popular databases are gonna be all relational. Mongo is the entire. Then there's a bunch of these other data models that sort of fall under this umbrella term of. A few of them are no SQL systems. It's usually the document data model or JSON. One of these. Cassandra essentially has become one of these. But there's key value stores, graph databases, and then why is this sort of best right? There's array, matrix, and vector data models. These are primarily used for machine learning or scientific applications. I think of like a satellite going around and taking a bunch of photos. You could store that as an array or tensors to train some kind of model. You would represent that as a as well. And so there's specialized database that can natively represent arrays. And then there's these ones at the end for old people, hierarchical network and multivariate. These are considered obsolete at this point. They still make a lot of money and use these. The hierarchical model is used in IMF. IBM's first database system that they built to evolve all the parts for the department in 1960. Both, right? Every ATM basically uses, I mean, not the box itself when you communicate to the bank. A lot of these systems are using IMS. Social security, social security in the US government still uses IMS. Anyway, so these things exist, but if you're a brand new part of it, you're not gonna say, I'm gonna use IMS or I'm gonna use IDMS. Like, you'd be insane to do that. But again, it's not really relevant here. Right, for this course, we will look at the relational databases and the relational data model. And this is why it was pissed off about the Japanese name because we go back to it. Justin, the judges have reviewed your response of matrix in the tech beat category and they have decided that it is correct. We've added $800 to your score and all of your wagers were made on the adjusted score. There's no such thing as a relational matrix. I sent them an email, they didn't respond. Okay. Anyway. Got a PhD. I think it's like neuroscience, but like, you know, whatever, all right. So, hopefully again, if nothing else ends this course, you can answer that question. So let's go back to the 1960s and this will again motivate why the relational model came about and it'll set us up for watching why it's still prevalent and why they use today. And opinion for any new. Yeah. Back in the 1960s, early 1970s, there wasn't a relational model, wasn't relational database. So there wasn't Postgres, there wasn't MySQL, SQLite, Mongo, right? The first database that people were building were not meant really meant to be general purpose, but then they realized, oh, instead of building these bespoke database applications or database systems for just this one application, I can make it a bit more general and now I can support any possible application. So as far as, you know, the first database system ever existed built by General Electric, which then sold to Honeywell in the 1960s. But the IDS, they originally built to keep in Seattle and then they realized, oh, instead of building this just for the, we can make some tweaks and then make it so the company could use it or a phone company could use it. And that sort of became one of the first general purpose database systems that ever ever built. I met with the IDS and realized, oh, we could use this for other applications and they generalized it. And then CODESIL is a standard that defined late 60s or 70s as a way to interact with data systems. People realize instead of having these standard, these one database systems that were being built all the time, we would have a standardized way for how CODESIL applications are less than 5%, right? I'm assuming who here sort of heard of CODESIL? One, okay. So why? You're right in the book. Got it, okay, you're right in the book, all right. Yeah, so people, CODESIL and these early systems, this is the way we'd interact with the database system and a lot of the things that were inherent in these first implications are things you would not wanna do today. And so one of the big things was in these first systems they had this tight coupling between the logical layer, like the schema, what entities do I have, what are their attributes, and then the physical layer in the system, meaning like how is actually being represented on disk or in memory and how would I actually interact with it, right? And so what happened is if you were a programmer at the time and you wanted to use one of these database systems, you had to know exactly how the database system was storing your database. And so that, because that would expose a different API to you to interact with it. On IMS you could store data as either as a hash table or a B plus tree or a tree structure. And then you've got different APIs because depending on which data structure we're using because hash tables you can't do scans, right? But in tree structures. So now the problem with this is that any single time I made it, I wanna make a change to the schema at both the logical layer and the physical layer, I had to go rewrite my application almost from scratch because all how the database was represented on disk changed. There's other issues with the data model, we don't need to really go particularly like in the higher quality data model, you had to sort of have these four loops traverse, you know, one collection versus another collection. It was really inefficient way to interact with the data system. What happened was in the late 1960s, there was this guy, Ted Codd, who was a pen and he was walking up in New York where the big lab is and he saw all these IBM developers primarily working on, saw them rewriting the database applications over and over again every single time there was a schema change, every single time it was a physical layout change, right, humans were roughly. So people, okay, yeah, we'll just put more humans up at the problem and, you know, that's better than buying a, so my example, he pointed it out, embedded in my application, Well, thinking of it now, that was like the idea of the hardest. I had to go rearrange the data, it's a trivial example, but that was the kind of shit they were doing back in the day. So Codd saw all this and said, you know, there's gotta be a better way And he changed the functional model as an abstraction of how to represent a database and the relationships between entities in a database and avoid having the tight coupling between physical layer and the logical layer, meaning if you define it in mathematical terms, I had these relations. Here's the manipulations or changes or lookups I could do on them. And I don't know, don't care, how the database system underneath the covers is actually going to store it. Because the database system knows what the data is, knows what the queries are. It can be a better position to decide how to. Someone could. So, I first proposed the relational data model in 1909. There's this tech report paper that came out like that. And again, it's a mathematician. He didn't find programming language at the time. He's not saying how you actually implement this. He's just defining what relations are and the algebra to manipulate them. So there's three key tenants of the data. But these are the three ones that we're going to care about the most. So the first is that we're going to store all our collections of data in simple data structures. For example, that's the mathematical term, same thing as the table. If we're coming from the Mongol world, and we call it a collection, it's the same idea. All these are simple data structures. I need to worry about all the memory sets that all these other data sets. Second is that you're going to keep the physical representation of the database itself, like the actual bits and bytes, on disk or in memory. You're going to leave that in your relational data model. Leave that for the data system to figure out what's the best way to do it. Because the idea now is maybe if I store the data in one way, but then as I see what kind of data I'm putting in, what kind of query I'm doing, the data is to the side, oh, I really want to store this on two machines or two cells or break it up in different ways or sort it in one way. And again, now your application doesn't change because you're just writing things queries at a high level. And you don't really care how short underneath the covers. And that's the line here. Like you're going to define at a high level. You're going to access data through a high level language, and then let the data system figure out what's the best way to retrieve the data that you're looking for, or to do whatever operation you want it to do. So Cod won the Turing Award for this in 1981. The side event in Kodasil, he won it in like 72. But it took 10 years for people to realize he was wrong. And it didn't take away Turing Awards. Cod is dead. The Codasil is dead in the 80s. And then the other guy that won the Turing Award in Database, Gene Gray, he got lost in 2006. And then the last Turing Award in Database is Mike Stoenberger. And he's Postgres and was an IPG advisor. We'll cover Stoenberger. So there's three key aspects of the relational data model. There's the structure, how we define what the data is actually is, what's in our relations, what are their attributes and their types. There's the integrity methods that specify what data is allowed to be stored in the database to make sure that every email address, I don't have any records without an email address. Or in my example, I can't have an album that has an artist that doesn't exist in the artist. And then the manipulation mechanisms allow us to find how we're going to access the data and update it. And the last one was what we'll talk about when we start talking about methods. An unordered set of data that's going to contain the relationship of attributes that represent some entity in the real world. So the relation model doesn't mean the relation between tables. It's really the relationship between attributes within people. So again, a student has an email address. They have a phone number, a birthday, a home address. It's that relation of those attributes combined together to represent some. So we'll have a set of attributes. I'll use the word triple. But it also can mean record or row. And at a high level, they all mean the same way, same thing. And so for every single value in the original definition of the relational data model, all the values have to be atomic or scalar. But again, as things have evolved over time, again, COD didn't foresee these things back in the 70s. This is why we support this today. And of course, our friends all have to be there. So in some cases, some attributes could have a value that is null, meaning it's unknown at the time. And again, you can specify in the schema what you want an attribute to support. So in every relation, there's going to be a primary key that's going to use or uniquely identify a tuple. Again, think of like your student ID or your Android ID here at the university. And that allows us to know exactly the single record that we want when we're doing something like this. An example here, a particular entry in our list. But again, your Android ID is unique. And so that would be a primary key for the university database. What you can do in cases where you don't have a good primary identity column is the primary key. You see this a lot in ORMs. If you're like, you can go SQLize if you're using Node.js. A lot of times these frameworks will make these primary keys for you like this, right? And so the primary key is a constraint that the database management system will enforce for you on your relational data to make sure that within any given relation, there does not exist a multiple records with the same primary key. Talk about how to do automatic identity columns throughout the semester. But there's basically ways database management system will can generate these columns for you. They're called foreign keys. And this is to enforce referential integrity to make sure that if I have a tuple in one relation that refers to another tuple in another relation, that the database management system does not have an orphan or a broken because you delete the record from one row to a dangly pointer in another row. Going back here to our database in this table. But some of it say we have this mixed tape here, right? We have a bunch of artists appear on it. How do we actually put it on this mixed tape? And I said before the relational data doesn't support arrays, at least the original version of it. So in the data model, if you do this, you would have a cross-reference table where I can keep track of the unique pairs of the relationships between the artists and the albums. And now, again, again, it doesn't work for you. So you just write the same delete command. And the key is not all of them. This is my string and approach with piping code. I'd have to do all that myself in my application code to follow those pointers to go find the things I'm looking for to make sure that I don't have these dependencies. So the way you do this is so it is if you're going to violate the last string you can have, or call it string strings, our protection mechanism, you can have the data to enforce to make sure that data follows the proper, has values that conform to some kind of domain, or that you don't have values or tuples that they take as components and other things. The idea is that you can specify, for a given attribute or within the global, within a given attribute or given table or across multiple tables, I can have every single time I make a modification or any of these constraints being violated. If yes, then I throw an error and I don't let you make any changes. So again, back to the example she brought up before, well, what if someone writes a string character in your column for my CFP file? The data system will actually prevent you from doing that if you had these encountering constraints. Because we would know, OK, you're trying to sort of a character into an integer column. I can't let you do that in a questionnaire. You can do other things like make sure that the year has been greater than 1,900, or greater than 0. And the data system will enforce that for you. The most common constraints that are out there are things that could happen for you automatically and doesn't let you make the change. But again, we define this in our schema and the relational data system will make sure that the data system is getting enforced. So now, no matter if Python code, he's in jail, you have to start making changes. You have to know what's going on in the database. The data system will prevent you from shooting yourself in the foot and making changes to these constraints. All right, so now we've got to talk about how we actually want to enter into the database, right? So I mean, we've set up this space. We've defined what our database and relations are going to look like. We want to talk about how we actually want to run So at a high level, there's two basic approaches to do this. There's the procedural and non-procedural languages. So particularly, the exact one that the data system does, whatever it is that you want to do, the non-procedural one is where you would say, here's the answer I want you to compute. I don't know how to compute it. You figure it out. So relational algebra and the SQL or relational calculus is going to be non-procedural, declarative. We're not going to cover relational calculus because you don't know it. I've never used relational calculus, have you? Never. Only the theory papers. Only the theory papers, which we're not doing. We don't care about those, right? But it is how it's going to work. If your baby is putting a query optimizer in a relational calculus, but in the real world, but relational algebra is not something that you don't really try to do in the program. But this will be what you want to do to run queries and build an execution engine and all the things we want to run. We want to cover the relational algebra at a high level, and then we'll go over through the correct course of SQL. It's the bottom variance of it. So the relational algebra in 1969, you also define the algebra that you would use to interact with the database system. And it's the way to extract data from it and compute answers that you want. So in the original paper, he defined seven fundamental operators. It's been additional operations you need in modern applications, like this doesn't have sorting. And obviously, in the applications you need sorting, so that came later. But all this is going to be based on unordered lists of records or unordered lists of tuples, but without any duplicates. Now, what's going to be slightly confusing is that in SQL, by default, it's going to try to use bag algebra, which is basically an unordered set that does have duplicates, or sometimes it's called a multi-set. And the reason why SQL wants to use bags instead of sets because it's actually, you can be more efficient if you don't care about removing duplicates later on. But in the algebra that we care about here, we assume we're going to get rid of duplicates. So the basic idea of all these operators is that you're going to take in some relation as your input, or one or more relations as your inputs, and then you're going to do some operation on them and then produce a new relation as the output. And you can start changing all these operators together to do more complex things. So we'll go through each of these one by one. Animations aren't working. I don't know why, but that's OK. They're just showing everything all at once. The first thing is to select, but I think the original paper calls that restrict, the idea is the same thing, right? But you're trying to restrict the tuples of some predicate. And the idea in your select operator, you're going to define first-order predicate logic to specify what tuples will match whatever the select operator is trying to do. It's a Boolean logic, like it's something equals something or something less than something, but you can have conjunctions, ands, or disjunctions, or predicate down below on whatever your target table is. Next operator is projection. I'm not good at Greek symbols. For this one, it's easier to remember, because it's a low pi symbol. And the idea here is as your input and specify what you want to be in the output of the operator. So you can rearrange the ordering of the attributes. You can remove attributes you don't want. And then you can manipulate the values of the attributes to generate new derived values. Again, but it's only operating on whatever that's given to subtract by 100. And now we talk to operators that take multiple relations as their input. So in the union operator, it's a binary operator, so you can get union operation from set theory. Generic operation that contains all the tuples that either appear on the one relations or both the relations. So here, I take union calls. So union will get, in SQL, will get rid of the duplicates. If you want to keep the duplicates and make it run faster, you would add it's union all, A-L-L. We don't have to worry about that. Just like union, we also have intersection, same thing. We can take the take two two relations. And the difference is to take all the two implementations and spit out all the ones that appear in the first one, but not the second one. And in SQL, you would use this with the set theory. This is basic set theory. This is not anything that should be surprising to anyone. So any questions before we move on to interesting things? Yeah, perfect. Yeah, so his question is, thank you for bringing this up. The question is, in all these examples so far, do the two input relations have to have the exact same schema? Yes. So you see, so the next, so if we start looking at how to combine together the relations and actually potentially look at what's actually in the values of them. So the first one we're going to do is the product operator, or in SQL, it's called the cross line. The idea here is that you're basically going to concatenate the first relation and the actual tuple in both. And it's for all unique combinations of tuples from one relation and the other. So you sort of think this is like two forlips, right? Where there's forlips spinning over R and every single in tuple in R, you're going to concatenate it with every single tuple in S. And you put everything in R. And I think I guess why this would actually be useful, why we actually want this. For example, I can think of testing, right? If you're trying to find all unique combinations of two inputs to test some piece of software or something, then you could use this. But beyond that, as far as I know, this is not widely used at all. And oftentimes, this shows up because you make a mistake. Like, you forgot a wear clause and you get this by accident. Or one of the tuples is what? Enum table. What do you mean, sorry? Oh, yeah. Yeah, so the same is like an enum table where they're trying to, like, but that's like for inputs, right? They're trying to get all possible enums as input to something. All right. The one which is super useful is to do joints. And the idea here is that when you're going to generate the combinations of the two input relations, we'll assume two for net, you know, it's a binary operator. There are multi-way joints where you can take multiple inputs, multiple relations. We can ignore that for now. You want to generate a new output that's going to contain all the tuples that match on the overlapping attributes in the first position and the overlapping attributes with the second relation, right? Different than the intersection. Because the intersection has to have the exact same. In this case here, I don't have to find the attributes where I do overlap. I'm going to check that. And so in the original relation algebra here, you would actually, you don't show the duplicate columns, like the AID from R and AID from S. So there's a bunch of different ways to write this in SQL. There's a natural join operator keyword in SQL. Or if you want to write it explicitly like this, that's the way to do it. It'll make your application brittle. The question is, does the order in R and S matter in this case? Why would it matter? Again, at this point. A SQL natural join, what if it's, like, two to three, would it play out both? Both of them, yes. So the join can be implemented with a projection across the side. Like, what's here? So the question is projection. So these are the ones that are shared with the fundamental ones that Ted defined. In the late 1970s, they got extended with a bunch of additional things. And building real applications, you need more than the original relational model, the original relational algebra. Renaming is like a projection. Assigning values, sorry, assigning attributes to important duplicate elimination, distinct clauses that we used to teach it. We don't teach it anymore because you never find it. It's basically a way, division's a way to find the tuple-run relation that matches all tuples in another relation. It doesn't appear in the real world. It's not common. We don't have to worry about it. The bottom line, again, the core of every relational database system is to be built on this algebra. And then we can define around these sort of circles that allow us to support very expressive and very complex SQL queries. Yes? So actually, that's what we need to do. Because you can find duplicate elimination on things with the primary key. So the relation is still sort of a piece of software that the system to actually execute it. Very specific between one order versus another. So say I do a query where there's the SB to 1 billion tuples together, 1 billion tuples together right away. So in relation to algebra, you're so specifying basically how you want the data system to execute things. So a better approach, and this is the motivation for SQL, is that we want to tell the database system at a high level. Here's the answer we want. We don't know how. We don't care how you're actually going to follow us. But this is what we want you to do for us. So we're specifying the exact algebra steps I want the computer to execute. Hey, retrieve the tuples from R&S and find me all the things where BID is. And then now the data system can take into account how the data is being stored, what kind of hardware, where things are actually physically located, what the CPU can support, what kind of CPU do I have? Do I have a GPU? Do I have an FGF? Do I have other things? All of that can be in its decision for how to execute the query, one way versus the other. And now, again, think of like, if I'm a developer, I can write a bunch of code in my laptop, write some SQL on my little test database. And then the same SQL statement will then work exactly the same, or it will so produce the same correct result when I deploy in production on a giant machine that has a lot of memory. And I don't have to change my application code. So that's the beauty of SQL. That's the beauty of why you're going to want a declarative language. But the core concept of how that declarative language is going to work is we predicated on relational algebra. So SQL is going to be the factor of standard, as we discussed. Click before we go. I want to talk about other data models. So I think I'm going to end this again. And then because vector databases is the hot thing on Hacker News, it's worth discussing what they actually are. So you guys understand why it's all bullshit. Let me see it in the real world. OK. So the document data model is old. So MongoDB came out in 2008 and said, oh, there's groundbreaking. We're storing things as JSON. But the ideas go back to the 1980s, early 1990s. They were object data. Object-oriented programming was the hot thing. So people said, rather than storing things as relations, we want to store things as objects. And then they had these specialized programming languages that knew how to write code in your objects, like in a database system. But those objects are the same thing as JSON. They're the same thing as XML. XML databases were the hot thing in the late 90s, 2000s. So at a high level, all these things are equivalent. So the data structure, we have these named fields. And the values of the named field can be an array, a scalar type, and so forth. And again, all the modern systems use JSON. So now the reason why these document database systems exist is this problem that comes up called the relation object impedance mismatch. That's the problem of, if I break my data up into these different relations, but I write my application code in like Python or whatever in objects, now when I want to go retrieve data from the database, a bunch of joins stitch together the form that I want to operate it on my application, but the database wants to separate or separate. And so I would argue, yeah, for some things, that makes sense. But for other things, it's actually a bad idea, because now you're going to have a bunch of duplicate data. You have all the problems you saw before, where I've got to make sure that if I change something that's duplicated in a file, or in my database, I make the changes all over the place. So we go back to our example before of the music. So the hot thing is, if I have an object that wants to get for a given artist, here's all the albums that they appear on. Now, you can do this in a single SQL statement, but it's not perfect, but there's ways to get on it. So the document data people will say, you don't need this guy here, instead of this. And if your application code is the same problem as before, like if my artist appearing on the same album, I could have duplicate entries in all their JSON documents. And if I make updates, I'm going to make sure I change all of them. And so natively, the interesting inflection point in the database marketplace, where almost all the JSON databases that said, we don't want to use SQL. We don't want to use a relational data model. A lot of them have basically converged and become more relational-like, and you can write SQL on JSON. And so over time, what we're seeing is that the intellectual difference between document data and SQL databases and relational databases has shrunk. And now, basically, they're all relational databases. Mongo was the last holdout. Mongo added support for SQL in 2021. But for years, because we know the founders are like, oh, we're never going to support SQL. They do. And then they also support JSON. So I would like to go to a document database that is contorting itself, look like a relational database, when you just go to the database. Postgres. All right. So they're in the news because, obviously, chat, if you use the hot thing. And so the way to think about this now is when pure building web allocations 10, 15 years ago, everything was always JSON. Therefore, I need a JSON database. And that's how the new SQL systems got started. Now it's like, OK, well, I have all these vectors I'm going to get back from my transformer and chat to PT or whatever I'm using. I need to store that in a database that can natively store vectors. So I can do that. And I would argue that these systems, they're limited in their functionality. And over time, they're basically going to have to morph to a relational database. So the vector data model, vector is just a ray. And in this approach here with FIGI systems, it's a one-dimensional ray of floating point numbers. And these systems are going to be a neighbor search, either exact or approximate, to allow you to semantic search over your data. So in all my predicates I showed before, it's like AID equals 2, AID equals 1 or 2, whatever the exact matches. But these vectors allow you to embed somehow magically through transformers a deeper meaning of what your data actually looks like. So you can ask higher-level questions like, hey, show me things related to this. You know, instead of doing exact matches for the keyword CMU, I can learn things like, oh, show me something to talk about in university in Pittsburgh. That was fan of my old CMU. Like, you can learn things about that are implicit in the data by then doing exact lookups. Yes? What's the difference between graph databases and vector databases? Let's take that one offline. That's different. But vectors are literally just one dimensional arrays. Graph databases are storing the relationship between objects, like edges and nodes and things like that. So if you're looking one degree from the current node, it sounds like it's similar. See, even if you're looking at one degree from the current node, it sounds like it's similar. But like the, how do I say this? In a graph database, you're explicitly storing that structure and you're traversing it to find things that you're looking for. This is like, I'm encoding it as a vector, and I don't know what the vector is actually representing. So Pinecones 5.1, a bunch of them gave a talk with us two years ago in two weeks. So if you want to learn more about this, check it out. At their core, all these vector databases are just going to be an index, a lot of you do nearest neighbor search. So it looks like this. Interpret these things, right? It's just a lot of numbers. If the need for them to figure it out, we don't know what it is as my rank-coding list. That's the core at a super high level of what a vector database is doing. Yes, this question, is this similar to relational calculus? What is? Sorry, we're a part of this. Wait, so for me, I'm just trying to give you an example. The thing I care about, so I would argue is the core of these vector is just this. No, you can do a bunch of relational databases. They all added this. It has PGVector, Simulastore, Clickhouse, beyond all of this. Now, I just want to explain this to you to see what I want. So databases are ubiquitous. Relational algebra is going to find the primitives of the process of the relational database system. And then relational algebra will be the core fundamental of these interactions. And then that will define how we want to build the component of our system to run queries. So Wednesday's class will be on C4. Project 0 is out. Please start it now. It's due the 11th. And then homework 1 will be out. Yeah, hit it. For I blow your face back. I got a block on taps. The facts can't trace that. Style is like tamper proof. You can't lace that. The Dominican. Oh, you could call me Dominican. Black Skelly black leather black suede. So I'm trying to skate. And that's your first mistake. I just had me wait and ran through every step.