 Welcome to the second lecture for 15.45, 15.645, Davis Systems at Carnegie Mellon. As I said in my last lecture, I'm not on campus right now. Joe and I had to take Inf to surrender himself. So right now we're in a safe house, all flat brush in Brooklyn. And Joe is not here right now. He went with them to the police station and he's dealing with that. It's not looking good. It's probably gonna be quite a while. He might not get out until 23. So it is what it is. As I said multiple times, if you get involved in databases, some crazy stuff can happen. And sometimes you sort of end up in jail. But that's the sort of life you live when we do databases. So it's around like two in the morning right now. But we'll be packing up probably in a day or two and heading back to CMU to proceed with the rest of the lectures in class. So I want to give a quick shout out to our people who've been helping us out. So Steezy, EPC Jones in Brooklyn, JL in Seattle, and SFO Mohan for helping us out in our time of need and taking care of us. So we really appreciate that. So for today's lecture, we're now going to be talking about relational algebra. And so I admit the original title or the topic for this lecture was going to be relational algebra and relational calculus. But then as I sort of sat in thought about it, I realized that for the purposes that we care about in this class, we don't really need to know. So I'll cover a little bit just so you know it exists. But our focus really is going to be on relational algebra. So just some quick administrative things to deal with. Homework one is out. We released it last class, and it'll be due on Wednesday, September 13th, and there's instructions online on how to do that. And then also in the same week on Monday, September 11th, we will release the first programming assignment, a programming project for the class. And so we'll provide you with the instructions and the download how to get started with that on that day. And I'll cover a little about it in class of how to get started and what's expected. So there's not really anything you could do to prepare for this other than being comfortable with writing C++ code, C++ 11 code, and debugging it with GDB or valvering whatever you want to use. All right. And then that'll be due later in September. All right. As I said, in today's class, I want to spend a little time talking about a bit more about the relational model. We're a bit rushed last time, and I'm going to cover some background information about a bit more. And then we'll spend most of our time talking about relational algebra. And we'll look a little bit at relational calculus, and then we'll finish off with looking at some alternative data models or alternatives to the relational model. And you get an idea of what these look like and how they actually relate to the relational model. So in order to really understand the significance of the relational model, we kind of need to go back in time and see what life was like before it existed. I remember when I first learned the relational model, I remember thinking like, oh, this is sort of obvious, right? This is clearly the way you would want to represent data in a database. Why would anybody do anything else? And this is sort of a testament to the, I think the importance of the relational model or just in some ways, its simplicity. It's one of those things where after it's been discovered, it seems so obvious to everyone. It seems so obvious that like, oh, of course, it's the way you'd want to do it, right? How would you do it any way else? And so you have to understand, though, before the relational model in the 1960s, people were using other data models, a hierarchical data model or the network data model. And these are sort of relics of history. They're not necessarily, you know, for us to go study and see exactly how they work. But the sort of the major thing to understand about these early database systems was that they were actually really difficult to build applications on top of and actually maintain them over time. And a big part of this is because that there was sort of no clean separation between the, what I'll call the logical layers of the database and the physical layers. So you can think of the logical layer as being describing what the database looks like. What tables do I have? What attributes do they have? And then the physical layer would be how it's actually stored in memory on disk. So to give an example of this, what I mean is one of the first database systems was developed by IBM, the system called IMS, Information Management System. And they developed it for the original moon mission, the Apollo moon mission to keep track of all the parts they were purchasing for the rocket. And IMS is actually still around today. It's still widely used in a lot of older institutions. And so the way IMS would work is that you would write your application and you would declare what tables you had or collections of data you had. And then would you declare how you wanted the database system to actually store it? So you could say, for example, this table should be stored using some kind of tree data structure. And this other table should be stored using a, using a hash table. And then depending on what data structure you ended up using, this would then expose different API methods for you to interact with the data. So for example, if you chose a hash table, you could then do, you know, point queries on it, but you couldn't do range queries. But if you chose a tree data structure, you could do both of them. And so the, in order to make this decision about what physical data structure you would use for your, for your logical tables, you had to basically know what all the queries, your application, whatever possibly execute before you deploy the database. Because the problem was if you say, you know, at some point in time, I only need to do point queries or single key lookups. Therefore I can use the hash table because that's faster. But then if your application then changes and now you need to support some kind of range query, you're not going to be able to do that efficiently on the hash table. And so what people ended up having to do was if you had a, you know, change with data structure we're using, you had to not only dump the data out and then reload it back in, you know, going from a hash table to a tree, but then you actually had to go through your application code and modify it so that it would make, it would invoke the tree calls instead of the hash table calls. So around this time in the late 1960s, there was a mathematician named Ted Kahr that was hired at IBM in New York and he was sort of going around and he was seeing all of these IBM programmers spending, you know, time over and over again, modifying or changing their database, database application code anytime the database physical data structure change or the schema change, because there was all of this information about what the database look like at sort of the physical level baked into the application code. So if you changed, like if I added a column, then I had to go change all of my application code to now deal with that extra column. And so what he developed was a mathematical concept called the relational model that was trying to solve this problem. And he published two key papers on this topic. The first was a tech report at IBM in 1969. And then this was followed up by the more full fledged paper in the communications of the ACM in 1970. And this is essentially where he proposed the relational model. And so the significance of this is that it provided an abstraction to avoid all that maintenance overhead that people had to deal with when dealing with the older database systems by hiding the sort of the physical storage of the database from the actual logical API and how an application could actually interact with it. So there's sort of three key principles that came out of these first these first papers that were revolutionary ground breaking. The first is that Ted Cod argued that we should store the databases in simple data structures, right tables, for example. And this is because the the the existing database systems that were out, they were actually having these really complex hierarchical data structures. And then they were, you know, prone to corruption. If you if some piece of that data structure got got messed up on disk, then the whole database would be lost. So he argued it would be better to store sort of the the database tables or relations as these sort of simple data structures and then build upon that more complex things. The second key thing is that he argued that the application should only access the data through a high level of language. And so what I mean by that is the you would write queries on on on on your database without working or even without knowing about exactly how the data is being stored, whether it's being indexed or not, right? You just sort of deal with it at a sort of abstract level. And the sort of this was actually a controversial idea at the time because people argued that there's no way a, you know, computer could ever generate a query plan that would be more efficient or better than one written by a human, right? Sort of the same argument people had at the time about, you know, there's no way a C compiler could ever compile queries or generate machine code that would be as efficient as somebody writing, you know, handwritten assembly. So the last principle of the relational models that the sort of related all of this is that the physical storage of data was left up to the implementation of the database management system. So what I mean by that is the application doesn't know and doesn't have to care whether the data is stored on say in disk or in memory, whether it's stored in one file or 20 files, whether it's stored on one machine or multiple computers. It doesn't know it doesn't need to care. It's just going to deal with the writing queries using this high level language. And what's nice about this is that the database system can then make decisions after the data is actually already stored to maybe modify how the layout in memory or how it's laying out the files on disk. And it can adapt it over time based on what queries, you know, show up. And so this means that when you first loaded your database and you had all your early assumptions about what your application was going to do, if that ended up changing, you could then either let the system do it or you could do it as an administrator that changed the storage of the database without changing any of your application. So these, you know, I can't underscore exactly just how groundbreaking the relational model was at the time. It actually took about 10 years for people to acknowledge or realize that this is the right way to actually store databases. There's a lot of pushback at the time from the existing database systems that were out there. Although IBM, you know, Ted Codd was at IBM and then IBM then ended up starting building a relational database system prototype, one of the first relational database systems in the 1970s called System R. They started this like 72, 73, but IBM actually never commercialized System R and they didn't actually release their own relational database system until 1983. And that was DB2, which is IBM's database system that's still available today. So again, this seems like an obvious thing to do now 30 years later or 40 years later. But at the time, again, this was considered pretty radical. So in addition to sort of proposing the relational model, Ted Codd actually also codified or proposed the notion of a data model as itself. And so we can define a data model as being a sort of collection of information about how to describe the data we want to store in a database. So the relational model is sort of one example of a data model, but there's other ones that we'll talk about later in the lecture. And so this is again, this is sort of a high level concept to describe what the data looks like. And then from there, we can then define what's called a schema, which is a description of a particular collection of data for a given data model. So we can say our data model is the relational model. And then for a particular database instance, we can define the various relations that are stored inside of it, what attributes they have, what value domains those address can have, and so forth. So again, the relational model is just one type of data model that's out there. We'll focus on that for this class. I would say that actually I sort of come from the school of the relational database systems. So I'm a strong adherent to its abilities, and I think that's the right way to go for a lot of cases. We'll look at these other ones later in the lecture. The nice thing about the relational model is that it's really flexible. So with all of these different data models here, they can be actually stored in a relational database system. It may not be the most efficient implementation. You don't want to store a matrix or an array in a relational database system. That's usually a bad idea. But it's flexible enough that you could do this. So again, we'll focus on the relational data model for this class, because this is essentially the de facto standard of the most widely used data model in database applications today. Every single major database system that's out there that's widely used by both large companies, startups, small organizations across the board, they're more than often they're using a relational database system. So the relational model is comprised of three ideas, three components. The first is that it's going to describe the structure of the database. This is essentially the definition of the relations and their contents, what attributes and domains they have. Then also it's going to describe the integrity components of the database, meaning what constraints can we impose on the tuples and their attributes in our database that have to be satisfied in order for a valid tuple to be stored in it. And then lastly we'll have the sort of defined how we manipulate our database, meaning how do we access and update its contents. And this is essentially what our focus, this last one is the focus of what we're talking about here today, relational algebra. We'll first spend a little time talking about the first two in a second. So as we define the last class, a relation is an unordered set that contains the relationship of attributes that represent entities in the real world. To use an example of the music store we looked at last time, we could say a relation here would be the artist of people putting out music and each entry or entity within the artist relation will have a name, a year, and a country. And we can define for each of these attributes the domain and values that they're allowed to have. So a year would be some year in a calendar and a country would be some country in the world. So typically or under the original definition of Ted Cosalation model, the values in a relation have to be atomic or a scalar. So that means you can't have lists, you can't have arrays, you can't have sort of nested values. We'll see how we can relax that later on and most databases nowadays can support things like arrays or nested objects. But under the original definition, you couldn't have that. And then there's also a special value called null, which works a lot like how nulls work in programming language. It essentially means that the value is simply unknown. So we could have basically, we have a value where we know something does exist, and then we can have one where we don't know that it exists, and that would be null. We'll see later on how this, how we store null, and how do we actually deal with it when we do processing. So just to drive on the mathematical concept or underpinnings of the relational model, in the literature, in my example here, this would be referred to as an n-area relation where n is the number of columns. And for our purposes in this course, we just assume or it's given that n-area relation just means a table with n columns. So the n-area stuff will come up later on when we talk about storing tables actually on disk, either as a row store or as a column store. And there'll be different schemes we can use based on this, which we'll cover later. So last class, we talked a little bit about candidate keys, but I want to walk through an example to show exactly how these work. So for every relation, it has to have at least one candidate key that uniquely identifies one and only one tuple. And so in our example here of the artist relation, there's actually several candidate keys for this, for this date, for this relation instance, right? We could have the name because we only have three tuples and the name's always unique. We have a combination of name and year, name and country, or all three tuples, all three attributes, name, year, and country. Because all of these are unique. But again, so for these candidate keys, these are tied to this specific instance of this relation. If we modify the contents, then these will no longer be possible candidate keys because they're not going to uniquely identify tuples. So to give an example, let's say that Joy goes back to India. And for whatever reason, he starts a new hip hop group, and he names his group Ice Cube, right? And so now the name by itself, the name actually can't be a candidate key because now there are two artists in our relation that have the same name. I don't know why Joy would do this. Because the real Ice Cube probably would not like this, and he probably would have some words with Joy. But for whatever reason, Joy has had to do this. But now let's say that somebody else in the United States decides, oh, I'm also going to start a group, and I'm going to call them Ice Cube. So now name and year can't be a candidate key because that's not going to uniquely identify a tuple. And then let's say somebody else is also even more crazy, and there's a third group called Ice Cube also in the United States that also started in 2017. So now in this example, none of these are candidate keys for this relation because any combination of them does not uniquely identify a single tuple. So in the last lecture, I showed how we can rectify this by imposing or adding a sort of synthetic ID column that we can guarantee to be unique across every single tuple. So now in this example, any combination of the attributes that includes the ID field is considered a candidate key because it's always guaranteed to be unique. So ID by itself is always unique, and therefore adding any additional attribute is also unique. So these are all still valid candidate keys, even though they seem somewhat redundant. So now the primary key is sort of a selected candidate key that is deemed more important than the other candidate keys. So in this case here, the ID attribute by itself is enough to uniquely identify any tuple, and it's the minimum number of attributes you need to do this. So therefore we could designate this as the primary key. And so the primary keys will show up later on when we talk about query processing and integrity exchange and things like that. Typically, this is something that you as a application developer would have to declare. The database system is not going to do this for you. You can create tables in some database systems without primary keys, but usually what happens then they then install one of these synthetic IDs, the sequence IDs as the internal primary key. So mySQL, for example, if you don't declare a primary key, then it uses the row ID as the primary key. And then internally for all its bookkeeping and how it tracks what tuples, what indexes point to what tuples and things like that, it'll use this internal ID. So again, we'll see this later on when we actually talk about row storage or tuple storage. So we also talk about foreign keys. And the foreign key is a way to record that there is a relationship between two different relations. In this example here, for this database, we said an artist would put out albums. And we said there was a one to many relationship between an artist and an album. So an artist can appear on multiple albums. It's actually started many to many. An artist can appear on multiple albums, an album can have multiple artists. So the issue we had to deal with is how we're actually we're going to store this. So in the case of this first album here, we only have one artist. So we can store that as a scalar. That's not a problem. But for the second one, this guy would have multiple artists appearing on this mixtape. So under the original definition of relational model, we can't store a list here, right? So we can't there's no way we can actually store this natively in the database with multiple entries. And so the way we would model this is that we would create a new sort of cross reference table that can map the artist ID from to an album. So in this case here, we have two columns in this this this relation. And the artist ID maps to the artist table and the album ID maps to the album table. And then now in the album relation, we don't need to store the artist ID anymore that stores separately in this cross reference relation. And so in the artist album relation as well, we recorded the we mark the artist ID, the combination of the artist ID and the album ID is the primary key. And this ensures that the same artists cannot appear on the same album in our database multiple times. And we designate the primary key of relation by the underline. All right, so that's again, that's sort of the same thing we talked about last class in terms of what the relational model looks like for the things that we care about. But now we want to talk about actually how we actually going to store and retrieve information from our relational database. And there's essentially two classes of what are called data manipulation languages, DMLs, to to store and retrieve information for the database. The first is a class of languages called procedural languages or navigational languages. And these are where the query is going to specify some high level strategy that the database system can use in order to compute the answer that the user wants on the data. And the alternative is a non procedural language where the query only specifies what data it actually wants and doesn't tell the database system anything about how to find it. So the first class, the procedural languages, this is this is what relational algebra is. And when I say the high level strategy or the strategy how to execute the query, I don't mean in the terms of like low level operations on data structures, or you know, I'm not saying like, you know, if we want to sort data that we should use quick sort instead of bubble sort, or if you want to do a join, we should use a hash join instead of a sort merge join. I just mean sort of the steps you need to take in order to compute the answer. The second one is what relational calculus is. And this is actually what SQL is derived from or based on. In SQL, you specify, here's the answer that I want, but you don't tell the data system exactly how to compute it. So then it runs through the planner and the optimizer, and then it figures out the best strategy to actually do this for you. So again, for our purposes, we care about in this course the relational algebra. And that's because when we actually build a database system that execute, you know, SQL queries that are sent to us from the application, what we're going to be operating on are a, is a plan tree that's comprised of relational algebra operators. And the non procedural stuff of the relational calculus, this is what you would actually use if you want to define a, or create a new programming language to replace SQL. You would want to base it off a relational calculus. Again, since this is not a programming language theory course, we don't care about this. SQL is good enough for us. It's not the, you know, it's not a perfect language, but it's pretty good and it covers most things. And so we're going to focus on relational algebra because that's what we need to know and this will come up multiple times when we actually start building a database system and have to actually execute queries. So the relational algebra that Ted card defined in 1970 was originally comprised of seven basic operations. And these are going to the fundamental primitives that we can use to write queries to retrieve and manipulate tuples in our relation. So this is actually originally based on set algebra. So if you've taken a discrete math course or taken any set theory course, a lot of these concepts should be, should be pretty straightforward. But then Ted card also added some additional three additional operators that are specific to operating on, on relations in a database. So the, some of these operations will be unary, meaning they take in one relation and output another, and some will be binary, meaning they take two relations as input and output another. But all of these operators are always going to produce a new relation as its output. And so what will happen is we'll be able to chain together multiple operators in succession and have the output of one operator be the input for the next operator. And this is going to allow us to implement more complex queries than we then these basic primitives can do. And for what we'll talk about here, this will cover most of the kind of things you'd want to do in SQL. So now there were extensions to relational algebra, both from Ted card and others later in the 1970s and early 1980s. For this, for this lecture, I'm going to mostly focus on the original definition. Anytime we deviate and talk about the extensions, I'll mention that. And I would say most of the extensions are sort of just things you sort of need to do, the kind of things you'd want to do with, with, with SQL. All right. So the first operator is the select. And so I think in the original, in the original works, they maybe call this restrict instead of select. Because the way to think about it is not necessarily select, like you write a select statement in SQL, it's more like, we're going to choose a subset of tuples from our relation and produce that as our output. And we'll choose that subset by restricting which ones we actually emit in the output. And this will be based on a selection predicate. So this predicate is going to act as a filter to only emit the tuples that fulfill it, that, that evaluate to true. So we can, we can generate more complex predicates by combining them together using conjunctions and disjunctions. And then we follow the same sort of Boolean logic you would in any discrete math course. So the syntax is to have a lowercase sigma, which is nice because it's easy to remember select is with S and Sigma is S. And then you have in a subscript, you define the predicate. And then you just, and within the parentheses, you define the relation you want to do the select on. So say we have a simple relation like this, we define it as AID BID. And we just have these four tuples. So we can write a query using the select operator by defining that our predicate is where AID equals A2. And that, that'll produce these tuples here. And we can do more complex things. As I said, using conjunctions and disjunctions. And now we can say, you know, select the tuples from R where AID equals A2 and BID is greater than 10. All right. So this look, if you understand SQL, this should look very familiar. But this is essentially how we would represent a SQL query like this. This is what it would get converted into the corresponding relational algebra query. So this is a select star from R. And then we had that same predicate that we had in the second example. The next operator is to do projection. And with projection, what we're doing is we're going to output a new relation that has tuples that only contain the specified attributes that we have in our projection list. And so with this, we can rearrange the ordering of the attributes in our output. And then although this is not in the original definition of the relational model, we can, something's going on. But now we're cool. Now, all right. You got it? All right. Sorry. So although this was not in the original definition of the relational model from TECOD, we can actually also manipulate the values of attributes when we output them in our new relation. So we defined this using the lowercase pi symbol. And again, this is nice, easy to remember because pi starts with p, projection starts with p. And then we have the list of our attributes defined that we want to emit from our relation. So using that same table example from the last slide, I can do a projection like this, where I would say, first do a select on the R relation and filter anything where AID equals A2. And then in my projection, I will take BID as the output subtract 100 from it, and which is going to manipulate its value, and then also emit AID. So in their original relational model, there's no real ordering to attributes. But in practice, when we look at it, we usually see these things. And again, so we can manipulate the order of our output here. And so this is equivalent to the SQL query like this, where again, you just have your projection list right after the select clause. And then this will produce the same result. And this is actually another example here in my relational algebra of daisy chaining together the different operators to do more complex things. So I did the select first, followed by the projection. All right, so now we can get into the binary operators. And again, these should be, these are just exactly as you'd expect from basic set theory. So the first thing is to do a union where we want to generate a new relation that contains all the tuples that appear either in one, the first or the second relation, or both of them. So again, now we have an example, we have a relation R, relation S. If I take the union of those two, I'll have all the tuples of R and all the tuples S in my output. So a couple of things to point out here is that this only works if there's the same number of attributes in these, both of these relations. So if there was a third attribute in S, then I wouldn't be able to take the union because that would be undefined for the relation R. Another thing to point out too is although in this, my example here, I have the tuples ordered based on, you know, R first followed by S. And relational model is unordered. So different database systems could store the or produce these outputs in any order they want. And those values would still be correct or equivalent. And then in SQL, we have a union keyword that we can use, we can take the output to select statements and union them together and you get the same result. And most database members systems that are out there that support SQL should support the union operator. And then in addition to union, you can also have intersection. And this is where we just, the output will only contain tuples that appear in both the relations. So in this case here, only a three one or three appears in both relations. So in our output, when we do the intersection, we would only get that one tuple. And in this case here, just like in union, in to take the intersection in SQL, there's the intersect keyword. And again, we have to have the same, the same number of columns in both relations in order to know if it's going to work. And then the last one, we can also do set difference. And this is again where we, we, the output only contains the tuples that appear in the first relation and not the second one. So in this case here, a, a three one or three appears in this relation. So that would not be included in the output when we take the intersection only the first two, the first two tuples would appear. And so for this, there's, it's not in SQL, you don't, there's no difference keyword. You would use the accept keyword. And this is, this is a equivalent, this is equivalent operator. All right. So then the last sort of operator bar from set theory is, is the product or sometimes also called the Cartesian product. And so for this, what it does is going to generate a relation that contains all the output, all possible combinations of the two input relations. So meaning for, we think about this for every single tuple and say R, I will then combine it with every tuple in S. And then I'll go to the next tuple in R and do, do the same thing for the other one. So now if I take the Cartesian product of these two relations, I end up with a giant population like this that has nine different entries for every single combination of a, of a tuple in R and a tuple in S. And in SQL, there's actually two ways to do this. There's a cross join keyword you can use to take, take the cross from between two relations. But also if you just select from two relations and don't have a where clause that defines how you want to join these two, these two relations together, it ends up taking the cross product. You get the same result. And we'll see this in multiple times that there's, there's many different ways sometimes to write the same query and produce the same result. And it's not always the case that though the two queries are run exactly the same way, the different database systems have different optimizers with different capabilities. And one end up main being faster than another. So now you, so you look at this, this, this, the cross product or the Cartesian product here, and it seems kind of like kind of a useless thing to do. Why would anybody actually want to do something like this? I would say it actually just shows up in a couple of different scenarios. Like in testing, for example, if you want to take all possible combinations of two datasets to do testing on them, you could use a cross join. If you just want to sort of get, create a matrix of all possible combinations for doing some kind of evaluation or analysis, Cartesian product cross join would do the same kind of thing. So now the, the, the next operator is actually sort of a, sort of a superset of the cross join. And that is just a straight join in, in, in relational algebra. And what this is doing that it's going to come join together the two tables or combine together their, their tuples where there is a, a matching between tuples in a tuple in one, the first relation and a match or the corresponding match in the second relation. And we would do this based on their common, common values in their attributes. So what I mean by that, in this example here, I have relation R and I have relation S. They both have the same schema, right? They both have two, two, two attributes, AID, BID. So if I take a join operator on them, then I would look for every single tuple in R. I would check to see whether there exists a tuple in S with the same AID and BID. And the way this works in relational algebra is that it looks for, looks for matches on the, with attributes on the same name. So what I mean by that is if I ended up changing this, the, the, the second attribute name in S, and I call it XYZ, when I did my join here, it wouldn't do the comparison with the BID. It would only look at AID in these two relations because that's, that's the only one that match. So again, this is sort of like Cartesian product in that, but you're actually doing, you're actually checking to see whether there are matches here, whether Cartesian product just sort of takes everything. So joins are a very important part of database systems. It's where people spend, usually spend most of the time. And so this is actually something we're going to focus a lot on this course. In terms of writings in SQL, this particular example is what's called a natural join because we're not specifying exactly what columns we, or attributes we want to use to compute the join. We're looking for matches based on the common names, right? So in, in SQL, there is a natural join keyword you can use to, to accomplish this. It's usually not recommended to do this because if you end up changing the schema, your, your queries might end up turning into a cross join and may end up producing not the result that you were looking for, right? Because there's nothing in this, the SQL statement here that you clearly defines exactly how you want to compute the join. It's sort of implicit based on the schema. And if the schema changes, the query result may change, which may not be what you want. So we're not going to cover exactly how you implement joins right now, but as I want to cover what the different join types you could possibly have. So the first one was the cross join. We already covered that. That's the same thing as a Cartesian product. The next class of joins are called inner joins. And this is what the, the natural join was part of. And so the, the inner joining, the way it works is that you have to have a tuple in the first relation, must have a corresponding match in the second relation. And how you define that corresponding match would depend on what kind of inner join you're doing, which I'll show in the next slides. I would say inner join is probably the most common type of join you're going to have. This is what you, you know, what people typically think of when you think of a join query. The alternative from the inner join is an outer join. And with this type of join, the, the tuple in one relation does not necessarily have to have a corresponding match in an outer other relation. Whereas the inner join, you do have to have that. And again, how you define what the corresponding matches depends on what kind of join you're doing, which I'll, I'll show in a few seconds. So for inner joins, there's basically four types. We've already covered natural join. And this is where we are matching tuples in R and S, where the shared attributes are equivalent. And we assume based on the name that they're equivalent. I'll say also too, now the, the natural joins was what was defined in the original relation model. Again, it's called natural because the idea is this is sort of the natural way you, you would join two relations that have the same attributes. The sort of another class of joins are, are inner joins are called theta joins or equa joins. Theta join is sort of the, the broad name of this type of join. And it's just where you're going to match tuples between two relations using some arbitrary join predicate that's defined by theta. And so if theta is just going to be a quality predicates, like does, does AID in R equal AID in B or S, then, then it's considered considered to be what's called an equa join. So again, a natural join is essentially the same thing as, as an equa join, but it has the, the actually implicit, right? So if we take a natural join R and S using the example before, we'll produce this answer. The theta join would be written like this, where now the theta variable defines exactly how we want, we want to do our join. So here it's the R, the R AID equals S AID and the R BID equals SBID. And so for this, now we see in our output, we're actually including all of the attributes from both, both R and S, where in the natural join, we only included the, we only included the two columns once. And so I don't know exactly how theta joins are defined on the relational model, but I will say that if you run this in SQL, if you, if you specify a natural join query in SQL, you will only get the attributes once, but if you run this using a theta join, we're actually explicitly defining exactly how you want to do the join, you will get all four attributes, right? So you could write the query like this, select star from R, enter join on S, and then we have an own clause where we define the theta part of our join, how we can join these together. And here again, we're explicitly defining that we're doing an inner join. We can also just remove inner, and by default you get an inner join. So you can write this as select star from R, join on S, again with the same own clause, and you produce the same result. And then a third way to write this is just say select star from R comma S. And then in our where clause, we define what the join predicate actually is. So the, any really good database system that has a decent optimizer will be able to identify that all these, all three queries here are equivalent and execute them, you know, sort of using the same efficient plan that it can find. I will say though, I think some earlier systems would prefer to have an own clause because now you're explicitly defining exactly how to do the join and it can optimize for that. But if you have a decent optimizer, you can extract out from the where clause how you're doing the join and generate the same kind of plan. So all three of these SQL queries are equivalent. So then the other two types of inner joins you can have are semi joins and anti joins. So in a semi join, what you're going to do is you're going to generate a relation that contains the tuples of R that match with a two point S. But unlike a natural join, the output relation is only contain tuples from the first relation. Right. So if I do a semi join R on S, and we designate that we're doing a semi join by not having the vertical bar on the, on the right hand side of the join operator. And in here, I would just produce that the RID, the AID values and BID values for R and not S. And I would write that in SQL using this exist keywords where you have an inner join. And any, anytime there's a matching tuple in S, then our, we would admit our tuple from R. So we'll talk more about semi joins later on when we talk about join algorithms. I'll also say that semi joins are really important for distributed databases because it's going to allow us to only send the, the data from the, the, you know, from the first relation and we don't have to send the second relation. And this is important in a distributed database because going over the network is really slow and using a semi join allows us to reduce the amount of data we have to send around. The, so related to the semi join is the anti join. And this is where we're only going to output the tuples of R that don't have any match if any other tuple in S. And we def, we, we, we, we write this using the, this, this triangles operator here, say R anti join on S. And the equivalent SQL looks like this. We use not exists and it looks exactly like the semi join. And then we'll get our output that only contains the, the, the attributes and there are the tuples in R that didn't match anything with S. Our outer joins are different from inner joins, inner joins, you always have to have an, an, a match, a corresponding match in the, both two relations. And if there is no match, then no tuple gets admitted. And an outer join allows you to specify that I want to try to find a match for every tuple in the first relation with the second relation. But if I can't find a match, then I still want to output the tuple from the first relation. So in the case of a left outer join, you would write that using the joint symbol, but the little horns pointing to what side is the sort of left outer part of it, or the outer part of it. And in this case here, well, what happened is say I do a left outer join on R to S. If there is a tuple that matches in with a tuple in R with a tuple in S, then you, you would include both of them. So in this case here, the last tuple A3, 103, this has a match in S. So you include the whole thing. But in the case of these first two relations, or first two tuples, A1, A2, they don't have a match in S. And, but rather than just excluding them entirely, we end up putting null and where the values of the attributes from S would normally go. And so the reason why you want to do something like this is say you're doing something like, I have a list, I have a database that has a bunch of customers, and then I have a bunch of their orders. And I want to generate a list of all my customers, but I also want to include their, the last, the data, the last order that they made. But let's say that I have some customers that haven't placed orders yet, and I still want to include them in this list. And this is where a left outer join would help you with this. Because otherwise, if you don't do a left outer join, there won't be a match from the order table, from the customer table to the order table. And therefore you, those tuples would end up go missing. And so an outer join solves that problem. The right outer join is the same thing as the left outer join is just the input relations are reversed. And then the full outer join is where you can take anything from the right side or the left side, including your output, even if they don't have a match. And so in SQL, you write this like this, in your from clause, you explicitly define that you're doing a left outer join. And then you have in your own clause, how you want to compute the join. All right, so some additional operators that I'm not going to go through, but just so you guys know that they exist are things like renaming that allows you to rename an attribute to a new name. Assignment allows you to take the output of one operator and actually store it in your database as a new relation, duplicate and elimination, aggregation, computing, you know, min, max, average count, some sorting is necessary because in relational algebra is a relational model is unordered sets. And oftentimes in applications, you need, you need your results to be sorted. So there's an additional sorting operator and division. I'm not going to cover division at all. In my, you know, 15 years of doing databases, I've never had to do any division outside of an academic standpoint. Oftentimes in courses, people, professors give, you know, how you derive the division operator from the other perimeter operators. But again, I think it's really unnecessary. It's only from an academic standpoint that this actually matters. It doesn't actually appear in the real world. All right. So as I said, that covers the high level relational operators that we care about. And we'll see these things appear later on when we talk about doing query processing. But I'll make an observation now that I sort of made earlier in the lecture about how the relational algebra is, although it's not writing low level assembly, or actually writing, you know, calls to the database to actually manipulate low level data structures, it is still specifying at a high level, exactly what steps the database system should do in order to process your query. So when I made that, you take this example here where I want to do a join on the R and S relation. And then I want to first filter out any tuples where the BID goes one or two. And there's two alternatives that I'm showing here. The first is where we're doing the join on R and S first, followed by then the selection filter to filter out any tuples where only include any tuples where BID equals one or two. And then in the second example, I'm actually doing the selection on S first to filter out the tuples that I need. Then I take the output of that select operator and then do a join with R to compute the answer that I want. So these seem like somewhat, you know, the same operations in the relational, a lot of these operators are commutative, so the ordering does actually doesn't matter. So both of these queries will actually produce the same result, but they can actually have a significantly different performance and actually executing them. So let's say R and S contain a billion tuples. And so in my first example, if I join R and S first, now I'm joining a table with a billion tuples with another table with another billion tuples. And then let's say that for my filter where BID equals one or two, let's say there's only 10 tuples in S that actually matches that. So I'm first, so I'm first going to do a join for between two one billion table or one billion tuple tables. Then after I produce, you know, potentially another relation with a billion tuples, I'm going to end up throwing away everything but 10 of them. And so if I actually do the second way where I do the filter on S first with the select, then I just find the 10 tuples that actually match. So then now when I do my join, I'm only joining R with a billion tuples with just now 10 tuples. And that's going to be much, much faster to compute, use much, much less memory. If I'm in a distributed database, I have to transfer less data over the network. So in this particular example, the second relational algebra query is actually the better one to use. But if we're given the first one, we have to execute that because that's what the user asked us to do. So a better approach is for the application developer just to state to the database system at a high level what query you actually want. And then let the database system figure out exactly what the best way to actually implement that query or actually that query for you. And so what I mean by that, let's say that what we want to do is we tell the data system that the query we want to execute is to retrieve all the join tuples from R and S where BID equals one or two. And so I'm saying this in English, I'm not even specifying exactly what steps to do. I'm just telling it a high level exactly what I want. And this is essentially what SQL is right SQL is telling the data system, here's the answer I want you to compute. I'm not telling you exactly how how to compute it. And so again, in our current age, this seems like the obvious thing to do because a database management system could always almost always produce a better query plan, a more efficient query plan to execute your query than what the average person can write themselves. There's obviously some corner cases or more really complex scenarios where maybe human DBA can maybe do a better job. But for most queries, the data system is actually going to be in the best position to figure out the best plan for the exit your query. And this will be a reoccurring theme throughout this semester, while I'll keep saying that the database management system always knows better than the human. And therefore it should be the one that makes the decisions about certain certain things, how to execute queries, how to store data and other things. And so this is what SQL essentially does for us. All right, we say exactly what we want. And the data system figured out the best way to execute it. So the underpinning of SQL is relational calculus, because relational calculus is a non procedural language or data manipulation language that allows you to operate on relational databases without specifying exactly what steps we're doing. So relational algebra still is at a high level in that we're not specifying exactly what algorithm to use like QuickSquirt versus BubbleSword for our different operators, but we're still specifying the order we want them to be executed. But in relational calculus, we just then we're declaring at a high level what answer we want. So there's two variants of relational calculus, there's tuple relational calculus and domain relational calculus. So tuple relational calculus is what Ted Codd rudely proposed in his papers in the 1970s. And the way to think about this is that we're going to define variables that will be bound to tuples. And we want to find the tuples in that relation that can satisfy whatever it is that we're trying to do with, you know, our selection predicate. And in domain relational calculus, instead of looking at thinking about tuples, you're going to actually think about attributes. And you're going to find your query based on finding the actually values within a relation that has the answer that you want. So let's look at an example to see what I mean by this. So let's say we have a relational algebra query where we're doing a selection on R. And we want to find all the tuples where AID equals A2 and BID equals, or it's greater than one or two. So the in tuple relational calculus, it would look something like this where we're going to say that we want to match a tuple that where that tuple could exist anywhere in the universe. But specifically, we want to find one that was in our relation R. And that tuple satisfies our predicate, right? Again, this is sort of another way of writing selection operator, but it's using the syntax of semantics of relational calculus. Now in the case of domain relational calculus, again, we're not operating over tuples, we're actually operating over attributes. So here we define that we're looking for a sequence of attributes, AID, BID, that exist in relation R. And where those attributes have that satisfy our predicate that we had before to do our selection, right? Again, these are producing the same answer in all three cases, right? But tuple relational calculus and domain relational calculus are equivalent to each other and to relation algebra. It's just another way of specifying what our queries look like. So again, this is all I'm really going to say about tuple relational calculus, right? It's just for you guys to know that it actually exists. So if you ever need to go actually write your own programming language to operate on a relational database, you could look at these things and understand how they work and then derive a programming language from that. Again, for our purposes, from a system of internals, we only care about relational algebra. So that's really all I had to say about relational calculus. That's it. Okay. So as I showed this in before, there's a bunch of different data models that are out there. But now that we have a better understanding of what the relational model looks like, I want to spend some time going through some of these other data models, and we'll view them through the lens of understanding how the relational model works. And this allows us to understand the advantages and disadvantages of each of these. So I specifically want to focus on these four right here, because these are associated with what's known as the sort of no SQL database movement. And these are no SQL systems are originally defined as a shooing SQL or the relational model. There's a bunch of other things that they're often characterized as doing, like not supporting transactions and other things. We'll cover that later. But let's just focus on these particular data model or these systems from their data model. So the first one we can look at is what's called a key value data model. And this is essentially a way to think about this as just a think of it as like a relation that only has two attributes, a key and a value. And the database system can store these in an associative array or, you know, depending on the programming language you're using, it could be a dictionary or a hash table. And essentially it's going to map a key to a to a value. And so if we use our example from our before we have this artist relation artist table and under the key value data model, the name would be the key. And then the value would be a composition of the year and country. And so the way this would be stored is sort of like this. You can you have a mapping from the key, which is the name to the value. So now in this particular example though, the value has two attributes. We have year and country. And in most key value database systems, the value is just a byte array, meaning that the database, the data system doesn't know anything about what's actually being stored in it. So it doesn't know that there's two attributes in the first one's year and the second one is the country. These, it just sees a byte array. And so what this means that is up to the application itself to interpret the contents of the value. So an example here, we would have to take our value and do a split on the comma and then convert the first entry to be the year and the second entry to be the country. And so the problem with this now is that if we end up changing our schema, and now we have say in our value, we have three attributes, we either need to go back and update all the entire database by hand, or we need to modify our application to now deal with some values that have two attributes and some values that have three attributes. And then also too, now we also need to like, you know, do our casting to convert these, these values to the type that we want. And so this is actually very reminiscent of the days before the relational data model because we're baking into the application code exactly what our database looks like, the actual physical storage representation. So now there are some key value database systems that actually support for compact values, meaning you can declare that the value is comprised of multiple attributes and then therefore you can natively operate on them. You would need this in order to do like secondary indexes if you wanted to do a lookup to find this example, finding all the artists that were in USA, you can, you know, without having interpretable values on the server side, you'd have to download all the contents of everything and then find the thing you're looking for. Whereas some of these other systems, you can actually index things directly inside of these values. So there's a bunch of different database systems that are key value stores. Probably the most famous one would be Redis, that's often used as a caching server. Amazon, DinoDB was another famous one that was very influential. RocksDB and LMDB are embedded database systems that are BerkeleyDB as well that expose a key value API. So what I'll say is that if you squint for some database systems everything sort of looks like a key value store and it depends on whether the value is actually interpretable by the database system or where they have to do it at the application level. So for example, in MySQL's NODB engine, at a high level it essentially is a key value store but just it knows how to interpret the values without relying on the application to do this. The next data model is the graph data model and this is where you're going to represent the database as a collection of nodes and edges and the relations between them. And so what happens is that for each node and edge you can annotate them with additional metadata or properties that define their contents, what they have. So again, using our example of the artist and album table we would have a node that defines an artist, in this case the Wu-Tang Clan. We would specify that it's an artist node type and then we would have separate nodes for the different albums that are out there and then for the edges between them that would define that an artist created a particular album or appeared on an album. So there are, there's no sort of standard query language that's widely adopted across all different database systems. Sparkles probably the oldest and sort of most famous but the major graph databases a lot of them use different things. There's a couple of these graph database systems that are out there. Neo4j is probably the most famous one if in a graph in startup are sort of startups that are related to that in this area. So I don't want to get into this debate right now. I will say that there's nothing preventing you from storing a graph database system in the relational model. Dealing strictly from a data model standpoint the two are essentially equivalent. What makes these graph database systems actually maybe better at operating on graphs than a relational database is that they provide an API that allows you to do common graph operations datafully. Like walking a graph for example. It's kind of hard to do that in SQL. You can do it but it's not trivial. So there are for most of the major database systems that are out there, relational database systems there are sort of middleware packages you can install that provide this graph API on top of relational database and it'll do all the things that these sort of special purpose database systems can do for you. Again that's another debate of whether our graph database is a good idea. I just want to show that this data model exists and people are actually using it. The next data model is called the document data model and now when I say document I don't mean like a Microsoft Word file or a PDF. You can think of it as like an XML document or a JSON object. So a document is a self-contained record that contains a description of all the attributes which came with it and within it and then their corresponding values. So again using the artist table as an example we could represent that as a single record in it in a JSON object like this where we have the name and we have the year in the country and then we have the corresponding values. What we also can do in some document database systems is have some documents contain data that's not contained in other documents. So in this example here if I have the ice cube document it has the city field or city attribute that contains where ice cube was born but in my Wu Tang document I don't have this information. So this is an example where you can define within a document since it's supposed to be self-contained you can define exactly what attributes it has and make me specific to just that one document and other documents within the same collection or same table would not would not have to have those. So that's sort of this is called sort of a schema-less database system where you don't have to define ahead of time exactly what attributes every single record has it's left to the application developer to decide what to to install. Another touted advantage of the document data model is that it supports nesting of documents. So remember that I said that the attributes in relational model their values have to be atomic or primitives or so atomics or scalars but on our document data model what I can do is I can combine together all of the information for a single entity in my database put them all together into a single document. And so to use an example let's say that for the artist Wu Tang Clan I can define all in my document all the basic attributes that I had at the original sort of high-level artist relation but now I can have inside of it a sub document where I have a list of all the albums that the band put out and then with each album they can have another sub list of all the tracks that appear on this album and the idea here is that if I was doing this on the relational model if I wanted to get all this information for the albums from the Wu Tang Clan and all the tracks I would have to either write three separate queries to go get them from the different tables or do a join across three tables. And so this idea of condensing or collapsing all the information for a particular entity in the database into a single object is called denormalization in the relational model and we'll cover that in two lectures from now. So this is not a new idea of combining these things together. It's been known since the 1970s. Furthermore, document databases aren't really a new concept on their own. Yes, JSON is new and so having a JSON database is somewhat novel but the idea of storing these objects like this this is equivalent to storing things as XML and there's XML databases from the early 2000s and there were also object oriented databases that would store objects like this directly in the data as well. So I don't want to get into the debate now we'll cover this when we talked about the normal forms and schema refinement about whether denormalization is a good idea or not. In some cases it makes sense in some cases it doesn't. So in this particular example we talked about how this database had a many to many relationship from artists to albums. So if we denormalize everything in the way we're doing here that means we'd end up duplicating the album information multiple times for every single artist that appears on it and then we get the issues of like how do we keep everything in sync how to make sure that everything if we have to change the name of something across an album how to make sure we update all our records so there's a whole bunch of those issues that we'll deal with when we talk about denormalization later. What I'll also say too is that although the original relational model definition doesn't allow for this kind of stuff most database systems that relational data systems nowadays at least the major ones support storing JSON attributes or XML objects directly inside a relational table so you can do sort of stuff like this. So in terms of what systems are out there probably the most famous doctrine database system is MongoDB and they're essentially a JSON database Couchbase or CouchDB is another one that does JSON, RavenDB, works on Windows. In terms of XML documents Markologic is one of the major XML database systems that came out in the early 2000s that are still around today and as I said most relational data systems allow you to store documents like this as attributes. All right so now the last data model I want to cover is called the column family. So this is sort of a trick you want to understand because it's a hybrid data model and the way to think about it is that it's like a key value store where the key will get mapped to the value that would it get mapped to is will be a column family. Now a column family will be a sort of collection of a number of rows where each one has one or more column names mapped to values sort of it's a multi-tier data structure. So to give an example say I have two records in my database I have the Wu-Tang Clan and Ice Cube so the first part here for both of these this would be the row key and this is what you would do do the mapping to the value and the value they would get mapped to is the column family here and you see that what we have are a combination of column names mapped to values and this again looks like the relational data model but the difference is that just like in the document data model where every document within the same table or collection didn't have to have exactly the same schema in the column family you have the same benefit as well so in the in the first relation here I have year in country but in the sorry the first record I have year in country but in the second record I have year country and city because these are separate column families and therefore this is this is this permitted this is okay so the as far as I know the column family data model was first came into prominence when Google put out their big table paper in mid-2000s and they sort of the first one was the data system that did something like this and big table is used internally at Google for a long time wasn't actually available to the public and they never released it as open source but now they offer it as a cloud service you can actually get you can run on big table and get to the column family stuff it's also actually used in two probably widely used no-segal systems Cassandra and HBase Cassandra was actually written first by Facebook and they ended up not using it at all so they just sort of put it out there and then people picked up it picked it up and actually started improving it and it's widely used in a lot of places and then HBase is a sort of operational database system on top of Hadoop the Hadoop file system that may use the column family Cumulo is a clone of HBase or big table that was written by the NSA and then they open the source and then forked the company based on it which is actually kind of cool so the major thing I gotta say about all of these is that all of these can be represented in the relational data model in one way or another and a lot of the arguments that the no-segal movement would make about why the relational model is a bad idea and why their approaches are better I think are not accurate at all I think that they're making the same mistakes that the early opponents to the relational model made about gate independence from the actual application code and they're also confusing the independence of the of the relational database system to actually store your data in any physical way that it wants and it could allow you to get to the same benefits that you're getting here for these different data models but without forcing the application to store the data exactly as the data model specifies again we'll cover that later on when we talk about physical storage but again the nice thing about SQL and the relational data model is that it's very flexible and you can you have that data independence that allows you to change one versus the other without having to you know rewrite or load the database back in I'll also say too that a lot of these no-segal systems are actually now supporting some dialect of SQL and the relational data model so for example in case of Cassandra the recommended API to access in Cassandra is now to create tables using basically something that looks like SQL and sort of hide the way you know they hide from from the application programmer exactly how the column parameter stuff works so again the no-segal guys were in vogue about 10 years ago but then everyone's come back around and then recognizing that the relational model was the right way to go and they're adding support for that all right so to finish up here again the the purpose of this lecture was really just to show you guys what relational algebra is at a high level we define the different parameters that we can have to process queries on a relational database and this will be necessary when we start talking about query planning query optimization and execution and a couple more lectures because essentially what will happen is we'll take a SQL query as input and then that'll get converted to a query plan tree that's comprised of relational operators and then we can manipulate that query plan tree to produce an efficient query plan and relational calculus is only necessary if you want to write your own program language which I for our purposes we don't need to do all right so for next class we'll be doing uh I'll I'll be covering advanced SQL and the idea here is that rather than teaching you the basic SQL queries which there are ton of tutorials on the internet or a lot of you guys already know or and it's also covered in the textbook rather than covering those rudimentary things I'd rather you know sort of take a more accelerated view and cover the more advanced things that may not be in the textbook that that you can do a SQL that you know we can walk through examples and see these and see these kind of things so we'll cover these these kind of things and then we'll we'll you know give some demos and things that they would postgres and see how it actually works so as I said I'm gonna apologize for not being on campus well I'm heading back in a couple of days and or you know from this point on all the the lectures will be in class in person so all right guys take it easy I gotta find joy getting some dinner and we'll head out take care