 Good to be back, everybody is very supportive and appreciative, whoever gave me the mixtape thank you, I'll listen to this later. So I'll give a round of applause for a DJ, DJ Mushu. So Mushu is the endless DJ in Pittsburgh, so I want you to show us what you can do, let her in. Alright. So. You want me to stop? Yes, stop. Yeah, yeah, we're good. I need to relax. Alright. So Mushu is the hottest on-site artist in Pittsburgh, so we're really appreciative of him being here. How are things, man? I mean, I'm good. I just heard that you dropped some dang beats, but I'm surprised you're still in Pittsburgh. Yes. So, yes, it is true. I did have a rough summer. Alright. So I haven't talked in a while. My ex-wife, back a year ago, still healing up, so I have to wear a mask, but I'm back. So thank you again. Mushu, thank you for being here. Okay. So, let's do this. Alright. So before we get started, I first want to talk about, sorry. Alright. So before we get started, I want to first of all thank Snowflake for sponsoring the class. They're paying for extra TAs, of course, development throughout the entire semester. If you've never heard of Snowflake before, they are a distributed cloud data vectorized data warehouse. And if everything I said there doesn't make sense, that's okay, because you'll learn all these things throughout the semester. I do. Alright. And so they will be giving a guest lecture at the end of the semester. They will also have intro charts available for senior students, and I'll put all this on Piazza and how to apply and help them out. And then they're coming to campus for the career fair, I think in two weeks. So we can make arrangements for students of this class to go talk to them if you would like. The other thing I want to bring up too about the course throughout the entire semester is that you need to understand where I'm coming from as I teach databases. And the most important thing you have to understand is I really only care about two things in my entire life. Alright. The first one is my wife and my biological daughter. And the second one is databases. So I really don't have any of the hobbies. It is just pathologically focused on databases. And so if you ask me any question about databases, I might be able to answer it. Some questions will be appropriate for the class and some we do after class, but I'm happy anytime to talk about databases. Okay. That's the most important thing about the entire semester. Alright. So for this lecture today, I'm going to focus quickly on course logistics. Everything is available on the syllabus website. I'm not going to spend too much time going over all the details of things, but we can cover these later. I want to focus more about the jumping right into the course materials. But the main thing I want to bring up first is about the wait list. So if you've probably noticed the wait list is massive this semester. I think we got to like 450. We're like in third or fourth place in terms of like the longest wait list in the university. It's some ML class in English and then I think us. So I don't control the wait list. We can only take 120 students. The wait list got so large that the SCS advocates took over it and I can't control who's actually admitted anymore. So the advocates will put people off the wait list and roll them as new spots come available. Just to be honest though, at this point, if you're not enrolled in the class, you're very unlikely to get in. Again, I apologize if nothing I can do. The good news is that Cuddy C. Charlie in the back will be teaching the data's class in the spring semester. So if you can't get in this semester, you have an opportunity to take it in the spring. Okay. Alright, cool. The other thing I want to talk about is the lecture rules. So I get very excited when I talk about databases and I end up start talking really fast. So if I'm speaking too fast, please interrupt me and tell me to slow down and repeat myself or be able to question as we're going along about the material. Please stop me and say this doesn't make sense. Can you repeat it if we can clarify certain things? Okay. I don't want you guys to interrupt me. If you have the questions like can you go to the bathroom? Yes, we're all adults. You can do that. And also we're not going to talk about blockchain this entire semester. What? We're not talking about blockchain. Why would I talk about blockchain? Because it's pretty relevant I think today. We can cover that later. Now we're no blockchain questions. Okay. And the other thing I do is I don't want you to ask, what I'm not going to allow you to do is at the end of the lecture, everybody run down to the podium and ask me the same questions about like slide forth, slide five. So I want you to interrupt me as we go along because if you have questions while we're talking, other people probably have similar questions. So please like stop me and we can go over this. Okay. So I'm not going to answer any questions about collection material immediately after the class. Okay. All right. So this class is talking about the design implementation of database management systems. In my opinion that's the most important class of software that exists in the world today. And that's why there's an entire class specifically about it. So this course would not be how to use a database or design an application to use a database. That is taught in Heinz College. You'll get exposure to SQL in the first assignment for the first homework. But beyond that, we're not going to spend a lot of time talking about how we can do certain things at the application level. Right. This is really about how to build a system on the inside to be able to execute queries and store data. Okay. Right. All the projects will be on the system we've been working on in fact, part of email called bust up. So this is an academic system that we've written in C plus plus. And all the projects will be entirely one repository. So this is why we encourage you to get started project zero as soon as possible to set up your Dev environment. Make sure you understand C plus enough in order to get through the class because the projects are going to be cumulative, meaning you have the, you know, the second project we built on what you built in the first one or use what you got in the first one. So if you screw up the first project and have trouble throughout the semester, so I would encourage you to start the projects as soon as possible. Okay. And then for late days were like total for late days about the entire semester. Obviously, if there's extending waiting circumstances or you're sick, you know, send me an email, don't email the TAs and we can talk through it. But everybody allowed for late days for the projects. Okay. All right. So all this material is on the course website. All the discussion and analysis about the course will be on Piazza. All your homeworks and projects you'll submit through Gradescope and then the final grades we posted on Canvas. Canvas has a bunch of other features like discussion boards and assignments and schedules. Ignore all that. Use the course website and then use Piazza. Okay. Right. For people that are watching this video now later on that aren't enrolled in this class at CMU. If you want to take all the assignments, everything will be available in Gradescope. So use that Gradescope code. We have given up discussions. There's a Discord channel for people outside of CMU you can use. If you're a CMU student, don't go there. Don't use these things. Go to Piazza because the TAs will be there to help you. Right? The TAs are going to be monitoring these things. So in exchange for people that are watching this for us giving all our material away, putting it public on the internet, I'm asking for someone here to finish my Wikipedia article. I didn't write it. Somebody wrote it a year ago and they put this line in here that I was born in the streets of Baltimore. So they got this from some interview I did with a few years ago. So then somebody put that in there and then they flagged it saying it wasn't professional. So if someone wants to come and fix this, my birthday is May 20, 1981. They also flagged that one. But anyway, so not you guys here outside of CMU. Please fix this. All right. So the first point for you guys here. I don't have to say this, but I have to. I shouldn't have to say this, but I'm going to say it is that do not plagiarize on your homework and do not plagiarize on your projects. These are not group assignments. You can discuss the code to get with each other, discuss the high level ideas, but you should not be sharing code with each other. Gradescope has now built in plagiarism detector. We will use it. Furthermore, because all of the projects will be public, there's people, random people on the internet that also implement these things on GitHub. So all we do is do a keyword search to find 445 or bust on. We download all that, make a fake student, put that in Gradescope and then see whether you're copying stuff from out on the internet. Please don't copy anything. Please don't copy any source code. Please don't copy from each other. If you have any questions like, hey, there's this clock replacement algorithm I saw in BoosterSTL, can I use that as inspiration for my work? If you're not sure, please email me because I'd rather talk about it and figure things out now rather than you do something stupid and then I got to go to Warner Hall and let you up. Okay? So there's the senior policy of academic integrity. So by having this now more moralized in the video, if you do screw up and you do plagiarize, I just show Warner Hall this video where you solve this, whether or not you pay attention or not, but I settled totally in the class and then this is proof that you did something you shouldn't have done. Okay? All right. Beyond this class, if you're interested in databases as much as I am, we're also having a seminar series starting on the September 12th where we invite people outside of CMU to come talk about their database systems. It's a combination of researchers with a lot of people from industry. So everything will be on Zoom and then we'll publish on YouTube afterwards. And so this is the lineup we're having this semester. And again, this is optional. This is beyond what we're discussing in the class. But if you want to learn more about how real systems are actually implemented based on the techniques and methods and fundamentals that we're talking about this semester, this will be an opportunity to learn more there. So snuffling, of course, is coming to find interesting. Like Gaia is a database built for Thomas robots. So I find that kind of cool. With SplinterDB, that's actually being written by somebody here in Tepper who's already has a PhD. He's at VMware and he's building the database system. So again, this stuff is super interesting, at least to me. And I encourage you to come check it out. Okay? So any questions about course logistics before we get into the material? All right. Let's fire away. Our databases, the second most important thing of my life. So let's play a game here. Can anybody name a database for me? He says MySQL. Anybody else? Yes, the back. MabuDB. Postgres. Radis. Radis. Neo4j. Neo4j. Yeah. All right. Anybody else? SQLite. SQLite. Okay. So this is a bit pedantic here. But everything they all just named, those are database systems. Right? That's the software that's going to manage what a database is. So a database as, so I'll define it is a, going to be a collection of data that's somehow interrelated together that's meant to model some aspect of the real world. Like if I have a university, they have a registered database, keeps track of all the students, what class they've rolled in. Right? The database is the data that represents the students in the courses that they take. The database system is the software that's going to manage that, that database. I think CMU uses Oracle. We're actually switching to Snowflake as well, but it's, when you register for classes, I'm pretty sure it goes through Oracle. And so database is, it could be the core component that you're going to counter throughout your entire career, whether or not you're staying in computer science, whether or not you're actually a developer, no matter where you go in the world, the end of the day there's going to be some kind of database. It could just be an Excel spreadsheet, it could be something that's multiple petabytes and really, really big. But this can be the most important thing, again, I think you're not going to counter in your career. And what this course is going to teach you is to understand what the database system software is doing when you store data, when you write queries, write queries on it. Right? So that's what we're really trying to understand here. So, to give a quick example and walk through what a database looks like and sort of build a straw van system, we could use the store, store it and do stuff with it. And then that segue into why sort of this approach is a bad idea and why we want to talk about all the things we're talking about this semester. So let's say we want to create a simple database that's going to model a digital music store. So iTunes, Spotify, Bandcamp, whatever you want to use. And we have a really simple model where we just have a table or a file that represents artists, and then a collection of data represents the albums that they produce. So the only thing we really need to store in this really simple database is just going to be some information about the artists and information about the albums. So a really simple system we can build for this would just be to store the artists and albums information in two separate files as comma separated values, so CSVs. So artists.csv and albums.csv. And every single line in that file is going to represent some entity in this collection of data. So we'd have one line for every artist and one line for every album. So now in our application code, every single time we want to do a query and look something up, we just have to open the file, parse it line by line, and try to find the data that we're looking for. So we have a really simple example like this. Again, the artist has named your country, and now it's his name, artist, and year. And so in our application code, if we now wanted to find and answer questions like what was the year that Jizo went solo, or his solo album came out, we would just write some simple Python code like this. We'd just go line by line, parse the CSV, look to see whether the first field equals Jizo. If yes, then print out the year after casting it into an entity. Right? This is technically, this is the database. This is not really a database system, but this is a way to interact with the database. Is this a good idea or a bad idea? Bad. Bad. Everybody said it was bad. Look at that. What's that? This clearing is linear time. We're not even there yet in terms of performance, but yes, he's absolutely right. It's linear time, meaning like, what if Jizo is the last line and I got to scan through the whole thing to find it, right? So my simple example, I have three rows, but again, always think in larger scales, what if I have a billion rows or a hundred billion rows? But I want to scan through every single line, try to find what I'm looking for. Yes. She said, what if there's multiple Jizo albums? We're looking at the artist's tables. We're looking at looking artists. You're getting into them. What if there's multiple Jizos? How do we know we have the right Jizo? What if Jizo is lowercase? What if Jizo is lowercase? Yes, that's another issue. Yes. Look at that. Yeah, to go out to Jiz, every time you want to do a query like this. He says, you have to go out to Jiz every single time you want to do a query like this. Not necessarily, but at large scale, yes. Why is it too large just for one example? Right. If the file's too large to store in memory, then at some point you already have to go to Jiz. That's a better way of saying what he said. Yes, that's another issue. But that's getting to performance things which are important, which we'll cover. I'm just more concerned about basic, like, is it safe or the right thing to do? Yes. Well, I was going to say that the columns aren't working. We won't be up to the same extent, which would be a bad situation. Right. So, she says that the columns aren't labeled. Yes. So, if I, my application code here, I'm literally far between zero, but I want the zero off that to be right. Right. So, you know, it's the, the metadata about what the data actually looks like is embedded in the application code. One more. Yes. Which is a really efficient presentation of the data. Right. So, he has to, if the parts have a single line every single time they're on this query, which is very inefficient. Yes. Right. So, in addition to being inefficient, parsing in your code might be extremely unsafe in the format of the data. It's often might be better that the management system already has some kind of formats and no parts are required. So, you couldn't inject evil data. His name is that part of the CSB could be dangerous because someone puts in a mouth one character and then somehow you can do you can inject quite frankly, like having to do something malicious. Yes. I haven't thought that's a new one. I haven't thought about that. I don't know whether the CSB libraries have problems like that. Problems. If you're just saying parse, who knows how to parse? Sure. Yes. Yes. Okay. So, I think we all, we all hit the, that's sort of the main points, but I'm going to go through a couple others here. Right. So, sort of related to what she said before about what if there's multiple jizzes. So, how do we ensure that the, the jizz has multiple albums and in our albums table, we're just storing the name jizz. How do we know it's the right jizz? Right. Or what if it's one's lower case and one's uppercase? How do we make sure all that's consistent? What happens if someone goes and overwrites the year in our file with a random string, like an email address? Like if these are just files on just, I can open up BI and write whatever I want to it. And now my application codes have expected integer and it's going to see an email address and it's going to freak out. And then more importantly, what happens in the case where I have multiple albums or sorry, multiple artists in my album. Right. The way I sort of design my database right here, I can't easily do that. Right. Because there's sort of one artist, one album has one artist. And then what happens now if I delete the artist, how do I make sure I also delete all the albums? Right. These are two separate files. The files that doesn't know anything that they're connected, how they're connected. So, if I delete jizz, how do I make sure I delete all his albums? I have to go right application code to go do that. Right. So now, in terms of implementation, how do you find a particular file or a particular record? We talked about doing linear scan. Right. This is the example here again. It's just looking line by line, trying to find what you're looking for. And when you're done, you bounce out. But now what happens if I have a application that wants to use the same database? Right. Now say this is written in Python code but I have a new application. I want to use the same files, but I'm going to write in Rust. Now in my Rust code, I got to write the basically the same logic to how to parse the file and jump to what offset to find the data that I'm looking for. That sucks. Then what if I have to thread at the same time but also want to write to the file? Concurrently. Now I can do file system locking to take care of that. But again, if I have a billion records in my file, do I want to take a lock on that tire file just to update one thing? Right. Bust me more efficiently, I can have more fine-grained locking. And the last one that nobody actually brought up is durability. So how to make sure that the data is safe if there's a problem, if there's a crash or something. So if I'm updating a record in this file and then my program crashes, what should happen? Should I come back and see the original version? Should I see the completed version or should I see a partial version? What do I expect to see and what would be correct in terms of the database semantics? And the one that nobody brought up comes up sometimes in different years. This database is too big to store on a single machine and now I want to either replicate it for high availability or I must split across multiple databases and multiple machines so I can run things in more parallel. How do I make sure all these things are consistent? So we've covered a lot of things pretty quickly but why my toy example here is problematic. And this is essentially the motivation of why you want to use a database management system. So a state management system is going to be a piece of software that's going to expose an API to your application code that allows you to store, analyze, and manipulate data in a database. The reason why we don't want to have to manage this sort of big reason why we don't want to manage this database in our application code is that it's a huge waste of time. If you're going off to like a startup and trying to build a new web app or a new software as a service do you really want to be spending your time making sure your files are all the data in your files are stored safely and correctly, right? No one's going to care that oh if you crash if your program crashes you come back all your data is safe. Like that's not a that's not a distinguishing feature to sell your product versus another product. That's sort of assumed it's table stakes assumed that you're not going to lose data. So why are you going to spend time in your application writing you're reinventing the wheel to make sure your files are safe we just rely on a database management system to do it for you, right? Or widely tested and deployed you know they're going to do much better job doing all the things we talked about here than you you know some random JavaScript program are banging this out for the first time, right? So you always want to be using almost always want to be using a database management system but you talk about the caveats where I don't think you want to and so a general purpose database system is going to allow you to define what the database looks like create the database put data in it query it update it do all administration to programmatically according to some data data model and so a data model is going to be the the data model is going to be a way we define that there are a high low abstraction for the concepts that would be storing in our database right so relational model is the key one we'll talk about the next slide but it's going to be way to say how you define what your data model looks like not necessarily maybe what the attributes are like is it integers is it floating port numbers but really what's the what's the what is the core entity in this database look like and the schema is going to be the way we we tell the computer or tell the database system what we want this database looks like according to this this data model so these are listed data models right we already listed the beginning except for Neo4j and MongoDB these be relational databases right this is what this course will be about because again in my opinion this is the most important one there's another category of systems to use these two different data models you might have heard the term new sequel typically when people say new sequel they typically use document or object model but you can also have a data model that defines the database and arrays matrices and vectors this is typically used machine learning and typically used also in like like satellite imagery or medical imagery and then there's a bunch of these guys here at the bottom that are what I'll call an old man old person data models these go back to like some of the things I advise you not to work there but I say that like a lot of the banks still use IMS which is IBM built in the 1960s so I mean these things still exist it's just no startup is saying I want to use a hierarchical data model or I want to use IMS right most most of the time you'll be using a relational data model right so for this class we're going to focus entirely on the relational data model but I'll briefly talk about the Docker data model and the key values will come out throughout the semester so with blockchain in web3 which I think is pretty relevant why is that up here your question is I'm just walking around to the spot you don't even have it on the data list your question is why is a blockchain listed here yes blockchain is so you would have a data model top of it but it's it's not a data model and I didn't even know I think it should be the other couple hold on I'm telling you like you know people are using it for web3 it's literally the future blockchain is arcade blockchain is the system you would have it's based on the distributed letters you store things in it the mechanism and the distributed story and the blockchain right blockchain is better than this and I'm I'm just I don't think you know what you're talking about to be honest I thought that literally it would be irrelevant it's not arcade we're not the blockchain it's not it's not it's not it's not it's not it's not it's not it's not it's not it's not it's not I don't know who that is hold on I don't know who that is no alright alright blockchain right so blockchain is basically a distributed there are there are blockchain databases sometimes you can sort of document them sometimes you can sort of sometimes they're key value stores but it's sort of an implementation of the system itself you still need something on top of it as I said to the guy they can interpret the bytes so you can have a relational blockchain database you can have a key value sort of blockchain database in my opinion my opinion the only good use for the blockchain in Bitcoin everything else is all a scam and a waste of time we can talk about it later on but like there is no reason why you need a visiting fault tolerant distributed ledger for most applications okay so we'll leave it at that we can cover this later alright so let's go back to the 1960s and talk about something early databases that were out there so the the back in the 1960s like computers were brand new obviously very big very expensive very slow but then people realized that they didn't want to write the applications like it's hard coded in the application how to interpret data and files they started building these general purpose systems so the early ones are the earliest one is IDS it's actually built I think by GE right Honeywood an IMF was built by IBM to manage the Apollo moon mission like to manage all the parts to build the rockets and these things are based on something called CODESIL but you've never heard of it that's fine back in the day the way these systems work there was a tight coupling between what the data must look like logically and the physical implementation of it or the physical manifestation of it on disk so what I mean by this is in the case of like IMS in the early versions you had to say ahead of time did you want to store a collection of data they weren't tables they think of it like a table do you want to store this as a hash table you want to store this as a tree structure and based on what choice you made then that exposed to you to do tree stuff this is hash stuff right so now the problem is if you change your mind later on I said well I actually want to do range so I don't want to use a hash table I want to use a tree not only did you have to dump the data back out and load it back in as a different data structure did you had to go change all your application code to now reflect the change of the API right so this was a huge problem but of course back then there was a problem at the problem so there's this guy who was a mathematician who just finished in Ph.D. University of Pennsylvania and he started working IBM research in New York and he saw all these like IBM database developers you're repeating a reinvented wheel over and over again by having to to re-implement their database code every single time there was a change like if they added a new column in the table it didn't work tables but I added a new column a new attribute I had to go so he saw all these people doing the same thing making the same changes over and over again and not to be cliche but he realized that there was a better way of doing this so he proposed the relational model in 1969 as a sort of mathematical abstraction to how you would represent a database and interact with the database to the first paper came out in the tech report in 1969 but then the follow up were pretty good the one that everybody senses this one from this article here in communications with the ECM in 1970 and so this paper was the things that he talked about in this seem obvious to us now but you got to see them back then this was very, very radical the same way the C compiler from the Unix guys that was the radical idea that you can take high level language and compile it down into machine code and that would be better than a humans ready assembly these are very controversial ideas at the time of course now the relational model persevered you've never heard of IMS or definitely never heard of IDS so all the ideas that all this the people were like hey this is stupid our way is better they all lost in 10 cod 1 so the relational model defines an abstraction layer for how we want to represent relations to avoid this maintenance overhead of the 10 cod we're seeing back in the day and there's sort of three key tenets of this the first is that we're going to store the database IE relations I'll explain what a relation is in a second and then the the next key idea is that the physical storage of the the database is left up to the implementation so no longer do you have to define I want to store my data as a tree I want to store my as a hash table or a column store or a row store you just say here's my relation here's my attributes and the database system can try to make the best decision of how it actually wants to store and again this was radical and again sort of procedural code you know COBOL FORTRAN or CCODE to make direct calls to the the database API instead you're going to use a high level language to tell the database system what you wanted to do for you and then the database system will figure out the best way to do that now you know this is a relational model purist you would say these are the key things we'll see this as soon as you find that nobody actually follows this exactly and some of the NoSQL systems borrow ideas from relational databases relational databases borrow ideas from NoSQL guys so this is not like set in stone but at a high level these are the key things so Ted Codd wanted the Touring Ward for this in 1991 and he died I think in early 2000s the other guy that invented the Codasil he wanted the Touring Ward in 2002 okay so relational models have three parts we're going to have a structure so this is a definition of what the database and their attributes look like we'll have the integrity constraints we can find what data is allowed to be stored in the database and then we'll have a manipulation API that allows us to have the definition of relations it's an unordered set that's going to retain relationship attributes that represent entities in the real world so term relations sometimes you think it means the relation between tables it really means the relation of the attributes and throughout the semester I'm going to use the word table and relation interchangeably and then within the relation you can have a tuple and that's going to be a manifestation of a bunch of attribute values that are collected together as representing an entity again I'll use the term tuple row record interchangeably they essentially mean the same thing for our purposes of this class so in the original relational model the values within a tuple had to be scalar or single float single date string whatever in modern systems this gets violated you can have arrays of integers as an attribute you can have json as a value but again the original relational model said you could do this and then our good friend Null is going to show up again and every attribute could potentially also be null for its value you can define constraints if you have any relation you just need a table so now the way we're going to identify unique tuples is through what is called the primary key so the primary key is going to be some set of attributes one or more that allow you to say here's how to identify an exact record or exact tuple that I want so in the case of the artist table I could use the name for the primary key so I know the problem with this this example here is that there could be other knock-offs groups other Wu-Tang clans and so in this example here using the name probably is not a good example of what to use for a primary key the name here is only going to Wu-Tang clan but there's other artists that have the same names so what we do is we can introduce an artificial we can introduce an artificial column like an ID field that will be some unique number some unique UUID that we assign every single record right so this case here we add this thing and this number here just represents the ID and that we make that the primary key and then you can see here the the underline and the name of the list of the columns the underline represents the primary key so a lot of data systems have ways to automatically generate these IDs they're called sequences and the SQL standard if you're using this it'll be a reoccurring theme throughout the entire semester there's the SQL standard but nobody actually follows that exactly Postgres and Oracle actually do a pretty good job MySQL is the worst offender so they like to do things differently for whatever reason they're getting better at it SQLite also does this as well but anyway so I'll try to make it clear like if I show SQL is this SQL compatible with Postgres and Oracle or is it MySQL primary key other systems if you don't find a primary key they'll make one for you and it's going to be hidden underneath the covers like a row ID for example so the next thing we also have are foreign keys and this will allow us to define how data from one relation is related to another relation like a map so let's say I have my artist and albums you see I introduce the artificial ID column I need to say you know what artist applies to what album right but in this example here for this mix tape here this sort of record 22 right there's multiple artists so I can't just put the artist ID in there because you know it won't work now you can't put an array we'll ignore that for now but I need a way to have this mapping back from a you know from one artist sort of one album to multiple artists I can create like a cross-reference table where now I have foreign keys from the like that so I have foreign keys from the artist ID to the artist table of the album ID to the album key so the foreign keys are going to allow us to define different cardinality relationships between different relations like one to one one to many many to many and so forth and we're just using these his question is is it better than an array so some databases would not let you will not support array integers there's that issue other systems I don't know whether you can say when you define this table so we go back here so we go back here define like this artist attribute some sets will not let you you have to declare that this is a foreign key to this table here and some of them won't let you do that right the other thing yeah so there's that and then the the correct way in the relational database form is to do a separate table like this I don't know the answer what I don't know whether it would be I don't know in what cases it would be better so one issue would definitely be if you want to ensure that one artist can only appear on an album exactly once once in a while then the array would you may not be able to force that in the array whereas this case here if I make the primary key for this crosslet stable the artist ID out of the pair at the same time twice so that's modest yeah do you have more in that sense yes to the back so what is it for like variable lengths or does it usually have to be so the question is do Davis doesn't support variable length attributes like a string the answer is yes and we'll cover that in two or three more lectures how we actually want to store this yes for simplicity we can ignore yes follow up to the question so how are you with this database if you're not using a race like with multiple artists under a single album it's so you have this cross-level stable right so you would have you define the separate table here and it has two artists ID and out might and you say the artist ID is a foreign key to the artist ID field and then the the the other is a foreign the database then we have to go look say oh trying to assert out might 33 does that actually exist if no then it won't let you store it so this ensures you can't have an album can't have an art ID out might pair that points to nothing other questions yes can you manually insert a row right yes now we can talk about next class like there's a class of software called where you write the object-oriented code and they'll generate the for you so in that case they will generate insert statements that populate it's not it's not a special right cool awesome right so we know how to define tables we know about the primary piece let's talk about actually how do you put data into the database and actually run queries right so the dml data manipulation language is the way we're going to interact with the database after we define our schema so there's two approaches we have a procedural method a high level code that defines exactly the strategy we want the data system to take in order to perform different operations so maybe not as low level as the python code I shared before but still we're defining exactly the steps we want the data system to take in a specific order for this we're going to base our the language to be based on sets of bags which I'll explain what that is what exactly algorithm we want to use or steps we want to take to produce some answer we just tell what the answer we want and it's up for the database to figure out how to actually go ahead and do that for us right so we're going to focus on the first method here through relational algebra and then for next class we'll talk about the declared approach again from the mathematical model which we'll cover now this declared approach is done through relational calculus I'm not going to teach that because 99% of you don't need it if you're actually working on the internal to a database system even then you probably still don't need relational calculus you all really need to work on the optimizer which is super hard we'll cover that later but for the second one you just think of this as sql sql is the the code of language that you use to interact with all right so let's go through relational algebra quickly and then this will end up being the building blocks for how we write our query execution engine to run queries to run sql statements right so the 10-card proposed seven fundamental operators listed here these are the fundamental methods or ways we interact with the database to manipulate tuples in a relation and so when you could have repeating values the truth is actually based on bags actually sets do not have repeats bags do so it is actually based on in real life we already use bags but in relational algebra we use sets and we'll cover this when we talk about sql and x-plus so every operator is going to take input one or two relations and the output is always going to be one relation the idea is together to produce some higher level the answer to the higher level question you have about the query okay so go through these one by one quickly so the first is the select operator so the idea here is that we want to take a to generate a subset of tuples from a given input relation and we're going to find this using some first-order predicate logic to specify what condition the tuples have to satisfy in order to be used in our output relation so in the original paper they've talked to this as the restrict operator but in the textbook it's going to refer to it as select and the idea here is we can have multiple predicates defined in our filtering operation to do more complex things and get the data we actually want so let's say we have a simple table here R has two attributes AID and BID so we could have a select operator that finds all the tuples and the output of this would be again the filter results I could combine multiple predicates together with conjunctions and disjunctions this is the boolean logic that you've taken before so I can say if I give all the tuples we're AID equals 2 and BID greater than 1 and 2 and it produces the output relation so the the analog in SQL is just the where clause in fact I find you can even use the term where and describe how it works in the SQL state you want your where clause exactly the same where AID equals 2 and BID so we're going to do that the next is projection so this one it's going to be to remove certain sorry question yes complex so you said yeah can you say where AID is let's say AID and BID with numbers AID is bigger than all possible the question is how complex is going to be the predicates your example is where AID is greater what are you trying to say can you do where AID is greater than all other AIDs in my relation all perfect let's say there are two number problems where AID is greater than BID so that was easy so yeah you can say so this is any kind of fully illogical here so you could say where AID is greater than BID doesn't matter is there a limit to how complex it is the question is there a limit to how complex it is it's a map you're in what now in a real system in a real system you can have these there's examples I've seen where people have in the real world single statements where the string itself the actual text the string not the results they produce the text itself is math like 10 megabytes they think about 10 megabyte text file in the where they have these giant instates so think of like it's something in an arrest so think of like a giant dashboard you fill the big based on a check box would be something else you could make these arbitrary complex as you want in a real system we'll talk about pre-optimized in a second and it's John who's figuring how to reduce it down to the most the most basic performance mission yes is the when the predicate's limited by it has to be through and through or false by a single tubal or can they be intratubal relationships or is is the predicate the predicate's here it has to be in the context of scope of a single tubal in my example here yes in next class we'll see no you can have you can have that's the queries you can have things like if you are an example find me the tubal word the AID is the greatest AID of all you can do things like calving or snowflake if you're qualified also counts this not just where so you don't have the causes and the causes are qualified causes for simplicity no but the answer is yes the way you would have to represent a model is you would do joints you would do like the the having calls basically the having calls is the same thing right the having is the filter to do a select and follow by another that's how the the sort of the comparing something in another relation the way you would do that relational model is you would do the first query as producer relation and then do a joint yes whatever the base we try to try to answer this is the query just in relation to this particular table or can you question this the if not a clearly this is a formula an equation the question is this equation limited to a single relation answers no we'll see joints in a second question over yes yeah on the overview slide we said there's procedural and declarative yes SQL is like a declarative yes query language but now we discuss the procedural languages and the other slide so David is I said that there's procedural and then my examples here I'm also show SQL I'm trying to teach you guys the relation algebra which is the the procedural approach I'm throwing in SQL because I assume everyone knows SQL just for you guys to map the relation algebra to the SQL states that's all I'm trying to do here yes thank you for clarifying okay let's keep going alright so projection this allows us to remove certain attributes that we don't want from our relation as in our output and we actually can reorder them and we can actually manipulate them again this will be in the context of the the import relation looking at a single tube at a time for instance I have the same same example here to the attributes I can have a projection operator the low phase pi where I say BID is minus 100 and then in an AM so I flip the order of the attributes and I'm subtracting the value by 100 right and again the analog to SQL would be just what comes after the select operator right you do it you manipulate it any way you want yes so all column compute functions also follow this string you're padding space or whatever yes follow this category so the question is do all string functions all under this the projections everything but when there are aggregations everything but when there are aggregations yes we'll cover that first yes yes go back is that projection you need to select the the question is the projection sorted or unique in any way no again this we're trying to get that set to so it's on order there could be duplicate actually it's on order right cool right so now we start bringing in multiple relations together so the first is going to be a union operator this is just going to be the basic union operator from set theory right you're going to take the two relations and you're going to combine all the results and put it together now in the original relational model you have to have the two relations trying to union together to have the exact same attributes right so in this case here they both are going to be so if I take the union all down I just put the first relation at the top the second relation of the bottom again it's on order it's not guaranteed to be like that but for simplicity we just say it is right so in SQL there's a union all operator and this again is the secret in SQL they allow duplicates in relational algorithm do not so in SQL if you want to keep all the union you get it removes it sorry it removes two all right we also do intersection again this is just set theory so we can take the we can get a new relation taking two inputs where the output relation only has the tuples that appear the first one but not the second right and so in this case here is the just AIE with A3 and one or two and then there's an intersect in our keyword or operator we can do difference again so this is taking all the tuples that appear the first relation but not in the second one yes sorry this is my name was the relation one the all or the not so the question is oh back here the question is in a relational model the relational model is union all if you remove the all it'll remove two so that's the sorry if you remove the all if you get the duplicates if you get the union all then you get duplicates what are the other ones keeper or duplicates SQL or relation algorithm both relation algorithm duplicates okay SQL I think they'll remove them for something we'll take that we'll talk about that next class yes yes for the duplicates of the sets and the duplicates also lie in the intersection well they appear the question is if there's duplicates in the in the inputs will they also appear in the outputs I think in relation algorithm yes it intersects I don't know we can talk about the progress just fast see what happens okay we have difference again oh sorry yes your question is are duplicates allowed not allowed at the tuba level or the adjuvant level at the tuba level right so I can't have like you know A, A, 3, 1, 3 multiple times if you want if you don't want to allow duplicates at the adjuvant level you can either define them as a primary key or we'll see next class you can define them as you have a unique constraint say within one column one attribute there could be only one value of you know the relation model defined unique I don't think it defines unique constraints you can't do it in relation algorithm everything in the example with R everything with R is unique and everything with S is unique so there's no duplicates in either of them if you do the intersection will everything in the intersection appear exactly twice on the relation model not the same so if everything's unique but R everything's unique in S there's no matches at all no no like there's no duplicates in R or S but there are elements in R that appear in S do all of those elements in the intersection appear exactly once or exactly twice I think relation model will appear once okay but if one of them has duplicates who knows duplicates within in itself yes if R has duplicates but S has only unique things if multiple duplicates in R match with something in S do the multiple things from R appear or just the one thing from S I think the one thing appears okay both of them have duplicates in this correct we we can look at the text afterwards I think I think it's very aggressive to remove duplicates so we probably oh yes sorry yes so our question is for union intersection and and difference do they have to have the same attributes yes that is the again in SQL I think it's the same thing as well yes so we have the accept keyword does the same thing for this right so now we can start talking about the sent stuff you can do it it's not that common in other applications but now we start talking about joins and that those are very common so before we get the join we have to talk about the product operator so the product operator also called the partition product is basically going to generate a new output relation that has a all possible combinations of tuples on the first first relation and all possible tuples on the the second relation it's just sort of matching them together and producing a giant answer like this right and so in this case here we're adding this dot identifier in the attribute names to specify where the where the attribute came from is it from R or is it from this again just think of like two you know two nest or four loops where every single tuple in the first relation I'm going to mash it together with every single tuple in the second produce that as the output right so this seems kind of useless right why would you ever actually want to do this it does show up in testing and for other experimental analysis techniques where you don't you have a bunch of inputs you can try all possible combinations throw it at your application and see what breaks right so this this does fit well certainly it is the building block of how we're going to actually do joints which are probably the most useful operator here so in SQL there's a cross joint operator or man cross joint R and S alternatively if you just put R and S in there without a where clause without a one clause again the data is we'll just mash these guys again yes is that also the same as any other kind of relation it's always true yeah I think you put where true I think it's the same thing or like left joint on some let's call it left yeah that's a another issue okay other questions alright so the joint operator is allows now I'm going to match between two post two relations I say too you can't have any way joints or multiple way joints more than two relations we can ignore that for now let's just say we have two so now in our joint operator we're going to match together the tuples where the values are values are exactly the same so again the original relational model the two relations have the exact same schema as you're just looking to see you're matching the attributes based on the name there's you assume that those are the same again this is not defining with foreign keys or primary keys it's sort of a simplistic version of this right so you just look for for every tuple looking at AID, BID and match it up with AID, BID value in the second operation and you do a snap up like this right so now this always comes up like hey isn't just the same thing as intersection that's because this is removing the it's different because this is removing the the duplicate attributes so you can sort of think of this as an example I showed before we had all the columns from the the first relation matched together with the columns from the second relation but the original rational model doesn't specify this in SQL you actually can specify exactly what columns you want to get out so we can ignore that so in SQL there's a command using this because this is going to do what I said where they try to look to see 5e all the columns and have the exact same name and then do the joint based on that it's a bad idea because like now if you have the schemas don't match anymore it doesn't work you can also use the using clause where you just say try to match the two tables that might to do a join based on these individual columns typically wouldn't use this right yes sorry but in the intersection then like to do the same result it would be interesting see so this comes up again is the difference is that sorry let's put in intersection you have to have the exact same schema and join you don't oh because it matches the the attributes that have the same names right so if I had if I had an s I had a cid when I did the join it would ignore it because the first one didn't have cid so it's like a more flexible yeah it's like a more flexible intersection yes say it again the cid the question is would cid be an result actually I don't know it doesn't matter doesn't matter because if you don't need to do a projection afterwards and then remove it basically seven operators and for the 1970s you can get pretty far with these in modern systems that you need a lot more a bunch of things that we're not going to talk about obviously re-aiming columns is always useful we talk about aggregations we talk about sorting we obviously need sorting and a lot of applications division is less rare so there's a bunch of other stuff you would need but since now we have joins the names of the columns matter how do we specify the name or whatever the question is since the name of the columns matter in a relational model again this is met we're not actually implementing you wouldn't actually implement a relational algorithm engine exactly as we're expecting it right so the by one example I had BID-100 I could then rename that BID-100 back to BID and then I could do the joins whatever I want yes what does division do I was thinking about it it's it used to be in the old class one of the homeworks was like to do this I don't know offhand right now in the old class one of the homework assignments was like oh show that you can do division out of operators and like it was always a time you don't need it yeah other questions all right so this is now going to segue into what we'll talk about in the next class but the again the relational algorithm is still the high level steps of how we want the data systems to compute our query right the order of these operators actually matter and if you're actually building a system that would implement this you would sort of fall up from the inside of the parentheses and work your way out right but the performance can make a huge difference depending on how you do some of these operators you want to be able to reorder them so in my example here say I'm doing a join of R and S and I can either do join first followed by the projection or the filter do all the math the tuples of BID equals one or two or I could do the the filter of B or S or BID equals one or two then do the join right this could have pretty big performance differences right like always think in extremes if I have a billion tuples of R and a billion tuples of S but only one tuple where BID equals one or two do I want to do the one billion to one billion join or scan the table or do a look I find the one tuple that does that we're one or two equals or BID equals one or two then do the join right relational algorithm really doesn't give you that flexibility and you know if you implement exactly as it's defined so in a data system today they're going to be allowed to reorder these different operators to produce the most efficient or most competitionally efficient execution strategy for your different queries right and these these operators are proposed but more these are these are equivalent but the order can make a big difference so yes question first and then the join or would you leave up to the database so this question is if you were to write this in SQL would you still want to specify to do the join first followed by the filter you I was thinking I was asking you can't do that you shouldn't do that in SQL the there is a way with nested queries you can do exactly as you're proposing a data system has a good optimizer could recognize oh I know what you're trying to do and reorder rewrite it to put it back to the way it should be so you prefer to have the database like optimizer always yes in 99% of the cases you want the data system to reorder things before you do it now there are cases where things get wrong and I can talk about those it depends on the sophistication of the query optimizer depends on the last time you clutch a statistic depends on what your data looks like here are the different caveats but in general you want to just write the SQL the way that you and the human thing is the right way to write it see what the query optimizer does with it and then you can go back and tweak it if necessary and in a way tweak it could be rewriting the SQL itself some systems allow you provide hints and say hey you're doing this these operators in the wrong order do it in this order or like you're doing a switch or scan pick this index and set there's a bunch of ways to force the data just to do the right thing for you that's not in the SQL standard standard. Every data set does something different. Prospects famously doesn't support hints, although there's exceptions to do this. In SQL Server, you can dump out the query plan as an XML, change it in hand, and load it back in. There's all sorts of things you can do. Yes? I thought since SQL is like a non-declared language, it's really up to the database management system to figure out how to see the end of the process. You said SQL is a non-declared? It is declared. Yes. Declared means it's not procedural. Declared means this is the answer I want, figure out how to do it. His question was, could you write SQL in a certain way that you could specify the different borders? My answer is yes, but the data system has a query optimizer. The query optimizer can look at that and say, oh, I know what you're really trying to do and you're doing the wrong way. Here's a better way to run the query you want to run. In theory, you should always be able to do that. In practice, it's not always as hard. Yes? Which comes out first, relational algebra or SQL? This question is, what comes out first, relational algebra or SQL? Relational algebra was defined in 1969 by Codd, but he's a mathematician. He did not propose a programming language for his original paper. In 1972, Codd then came out with his own programming language called Alpha. No one ever used it. I had built a query language, a relational model called Square. No one could use that because you couldn't type it vertically. No one types like that. And then they came up with SQL. And then the guy that also, Professor Berkley, my street breaker, was also my PC advisor, the guy that did Postgres and a bunch of things, he came up with a zone language called Quell. It's basically Codd defined here's the programming language. Everyone came up their own. And then the IBM guys, after they threw away Square, they came up with SQL, which is supposed to be a play on words of Quell. So it's the SQL to Square and SQL to Quell. But then IBM got sued because somebody else had to use their name SQL, and that's why I got renamed SQL. We'll come into the next class. Sorry, I get very excited. Somebody slow it down. There's a lot of things I want to discuss. Okay. So again, we still want to be able to say in a high level way how we would be able to compute this. This is what a declarative query language would look like, right? Retributing the relation words if I joined it with R and S, where D, I, D equals one or two. That's written in English and natural language, but we can define this in SQL. And the data that could interpret this figure out was the best way to execute this. So this will be the segue for what comes in the next class. The relational model is independent of any actual query language implementation. Of all tests and purposes, SQL is the factor of standard. Every five years, every so often, somebody shows up in Hacker News and says, I have a better version of SQL, and it never works. It never fails, right? SQL was here before you're born. SQL will be here when you die, okay? Yeah. Even though IBM invented it in the 1970s, it's not a dead language, right? There's new versions. It gets extended all the time. As new ideas come along, JSON could come along, and now you can do JSON and SQL. That's in the SQL standard. So it's a depth and a bulk over time, and it looks a lot different than it did back in the 1970s. Now, the challenge is going to be, you say you know SQL, but there's the SQL standard, but nobody follows it, right? I said, my SQL is a big offender. But every single database system has their own proprietary extensions that deviate from the SQL standard. Then they sit on the standard's body, and they go, say, hey, I have to do functions that my system has. Let's get to the SQL standard. And then Oracle says, yeah, they've got something different. It looks slightly different. So then they bicker over what should the actual language say, and then they end up with something in the standard that nobody actually ends up using, right? So basic select, insert and update, delete, selects, all that to the standard. But it's when you start doing the more sophisticated things or more complex things, that's what nobody follows it. Again, it's my opinion. Postgres and Oracle probably are the best two, the ones that follow this standard are the best. Right. So again, just using our example at the very beginning, instead of writing this procedural language like this, we just say, hey, select start from artists where name equals jizz. And the days that we'll figure out how to actually to figure out the best way to run this square. Right. Because you know, somebody brought up before, if it's in a giant file and you do a linear scan, right, that's the suck of a billion records. So I can go to index. We'll talk about what indexes are in two of the classes. But you know, nowhere am I, in my query, did I say go use this index to find my thing that I want in log n time. Right. I don't have to know my indexes in my query language. The database system will figure that out for me. Question, yes. So basically the second, all the database will actually do that. So the question is, well, a modern day, excuse me, if you have a D plus 3 index, right, but like just see what does it know anything about indexes. Right. I said, this is the answer that I want. And the days of the side, do I want to use an index or not? And you can have a hash table index, right, that could be a one, right. I don't, but if I I write this query, my application, I have two records at the very beginning. So I don't need an index, but if I go to a billion of records, and then I go now, create an index to state up the query, I don't have to go change my applications, because they're independent feature. That's the genius idea. It seems obvious now, but back then that was a big deal. Okay. So if we have like 10 minutes quickly, and I just want to show you, class ends when? Four minutes. Four minutes. All right. Let's bend quickly through just talking about the Docker data model. Just so you guys see this once, we're not really going to cover too much the rest of the semester. So the Docker data model, sometimes called the object model, XML databases, JSON databases, all of these are the same thing, right. And these are the bunch of sort of the leading ones. Mongo is probably the most famous one of all these, but it's not the first object databases go back to like the 1980s. All right. So basically the idea here is that instead of having separate relations, the way we have a simulation model, they're going to try to match things together and embed objects or embed data within each other, right. So before, if I had this, the three tables or three relations, artist, artist, album, album, if I want to say, find me all, find me all the albums that this, this artist is on, I got to do a three way joint between these tables, right. So the Docker data model guy say, Oh, that sucks. That's slow. That's not how you write your application code. What you really want to do is just to find your application code based on objects, like an object or the programming. And then I'm going to store the data for a single record in a giant JSON file or YAML or XML. Okay. They're all, they're all quit, right. And so what you're basically doing is you're embedding the, the, the, the, the two relations you're sort of embedding the, the, the pentapulation or the album is the child of artists. You're then that side, side of this. So now if I say, go give me all Jin's albums, I go do a lookup to find the record where me was just up and it was part of the, the, the paper to get back my JSON document. I'm going to have all my albums, right. So now the new joins and says, go one fetch, go buy what I need. Is this a good idea or a bad idea? Give me what we talked about today. Yes. He says, what if you want to go out to artists? Absolutely. Yes. How would you do that? Jump it out, level back in. Yes. Do a database stem that's used to slow down, just abstract the thing and then do the query back in an efficient way or? This question is, do, do document databases abstract this away so you could just do the right thing? Not really. Like, like MongoDB is, you ever use MongoDB, the first five minutes are beautiful. Like you open up the terminal, you start sticking JSON in, the store is fantastic, right? It doesn't enforce schema, it doesn't enforce types, you can put whatever you want in it, right? So the application could abstract it, but yes, it doesn't. So, okay, from our relational models perspective, this is a bad idea, right? Because as he sort of said, like, what if I want to actually want to store, forget out where all the artists have to store it, right? Or we talked about how it happened to have multiple artists. That means now for JSON, for anybody else that's on his albums, I'm going to have repeated records with, with duplicated data. But then now let's say there's a typo when I put it in the album and I got to go change the album name. Now I'm going to make sure for every single artist that's on the album, I have to go update the, update the record, right? So I'm going to put this entirely, in my perspective, the relational model is the right way, almost always the right way to model your database. There are some cases where you do want to sort JSON, because it's just easier when you get up and running pretty quickly. But again, you can do that in the context of a relational data. You could have an attribute type be a JSON deal. Postgres, MySQL, and most of us will do this for you, okay? All right, so, wherever time, database that you've invigorated, this class is important. Relational model is the building blocks for promo time the rest of the semester. And the next class will be about SQL, okay? Bye, guys. Hit it.