 Those of you who chose to come in person instead of watching on TV. Welcome. It's nice to have you sense of solidarity. It's good I know the homeworks due tonight and everybody's working hard. I'm sure so good luck with that And don't forget that you know, you got to get the get in a can tations exactly right this time So leave yourself time At least an hour. I would say to mess around with git and feel like Yeah, you understand how it's working and you're submitting right Maybe peel off one of the team to go look at that before things get urgent I'd hate to see people lose points because they didn't submit properly, right? But it's important and I want to make sure you guys do it correctly and not burden the TAs with 27,000 exceptions. Okay, I Realized at the end of last lecture as we were peeling through the summary that there was a really interesting little tidbit Embedded in the conclusions that somehow didn't make it into the content of the lecture and this last line on this second to last slide We talked about B trees and isams and all that good stuff And then you know one of the things we spent a little bit of time on last time But not a lot of time was how this relates to the previous lecture's discussion of alternative one alternative two and alternative three Indexes so let's take a minute and review those and then I want to just kind of hammer home this point in yellow Okay, and you may see where this is going, but it's worth the review So remember an alternative one index which could be a B tree, but it could be a hash index or something But I'll draw it as if it's a B tree the data the actual data is in the leaves of The tree is In the index okay, and in a B tree that means the leaves of the tree So in our B tree right there's going to be this Tree and then there's the bottom level of the B tree with a linked list of pointers, right? So this is the leaf level of the B plus tree Remembering to be plus tree the data the true data is always in the leaves and in an alternative B plus tree These data items that are in the leaves are actual records in the database These are the primary copies the only copies of the actual full tuple Okay, so if you're looking for students with by GPA you can look them up by GPA But the actual student record is here in the leaf of this GPA B tree All right, so this might be a B tree by GPA, but the full tuple is here in the leaf and only here, okay That's alternative one So if you have another index by say student ID It would not be able to be alternative one. It would be alternative to and it would have pointers into This B tree all right because that's where the tuples are it would have record IDs Which are page number slot number that reference leaves of the B tree? Make sense So that's alternative one alternative to index contains key comma RID pairs In a B tree they would be in the leaf so in the leaf of the B tree in this picture We might have a B tree on Last I don't know email address, okay of a student and then in the leaves of the B tree We'll have the keys that is to say the email addresses and a record ID pointer to somewhere else in The database now in the pictures in the slide the somewhere else was almost always a heap file And I drew it as a big wide rectangle at the bottom of this B tree There's an additional level which is say a heap file here, okay, but it doesn't have to be he finally could be Another index that's alternative one right okay, so these record IDs They'll all go to the same place because the the actual relation is only stored one place that place could be a B tree It could be heap file and then alternative three is a variant of alternative to which is key comma Set of RIDs Which deals with duplicate keys a little bit more compactly, right? Okay, so that's the reminder having seen this what I want to emphasize is this last bullet now If your data entries are data records, which is to say alternative one so your B tree Your data entries are data records. What happens when you split a leaf page the tuples move around Which means that their page ID changes which means that their record ID changes because record ID is page ID comma Slot number, right, so let's remind ourselves of that RID Is of the form page? number Slot right, so if you split a B tree in alternative one things change page numbers Which means the record ID is changed which means all pointers into that alternative one B tree have to be updated upon the change Which is expensive? All right, if you have a lot of alternative to indexes pointing to an alternative one index when you grow that alternative one index You update all the other indexes to fix the record ID pointers, right? It's just a lot more Ios upon update All right, and note also that the insertion you do into that alternative one tree affects not just one record It affects many records because when you split the page many records change record ID And those records let's say they're all people with GPA 2.4 Right the emails for those people are not alphabetically Co-located right so when we go to the email index to update all the pointers to the GPA 2.4 students They're gonna be bouncing around in this email index looking them up, right? And they're gonna be in different places, so you're not gonna get me cash locality in the email index when you update the GPA index and split a page, so it is kind of costly it could be many many random Ios If you change a record ID in a B tree, so it's it's it's a design decision that can bite you Update cost even though it might give you more efficiency if you search into that index very often, right? The benefit of alternative one is you don't have to follow yet another pointer once you get to the bottom of the index But the cost is on splits you have to do more updates to other indexes Got it So all that was you know hiding in the last bullet of a summary slide I apologize Okay, the rest of the summary slide is summary So key compression we learned last time will increase our fan out which could decrease the height of the tree Which is a big win We learned about bulk loading it's much faster than multiple inserts And then to emphasize one more time B plus trees are widely used because of their versatility They handle range queries as well as point lookups And it's worth saying that there really are the workhorse at the bottom of the database if you have the kind of system That does a lot of small lookups Okay, the workloads that that happens for are typically what are often called transactional workloads like people go to a website and they Do a search for a specific say product name or they do a purchase and you have to record their purchase Those are small lookups This is very different from what are called analytic workloads We're like you look at the log of all purchases people have done because you want to build a recommender system And you want to know what people gather what things if you're doing a big aggregation query select star from table Group by this then you look at all the tuples you'll use sequential scans typically But if you have a workload that's mostly like you know update the amount of widgets We have in the inventory with number 472 like a purchasing system You'll typically use bee trees or something like them quite heavily so in these transactional systems Bee trees are a real workhorse of the system so they get tuned a lot and in modern days What that means is that you may find a lot of hackery in the system on the main memory representation of a page So that say binary search on the page goes really fast You may find a lot of hackery that we'll learn about later in the semester in some detail about how to do Concurrency control on the bee tree making sure that multiple people can be wandering in there Some of them doing insertions and deletions while others are doing searches right, so there's a lot of a lot of optimization that will go into your index bee tree package and with Flash technology and with a lot of things being in memory and then worrying about cash Locality on the processor you'll find in the field that these things get to be pretty tightly tuned And we won't get time to cover a lot of these tricks in class One more comment along those lines is that for systems that are geared towards main memory databases where the data really should never go to the disk except for failure recovery There are a bunch of systems now where that's totally feasible because all your data could fit in RAM Bee plus trees get changed around typically in those implementations And they don't look exactly the way we learned them here because some of those overheads about disk accesses Make less sense when you've got an in-memory data structure So keep in mind that the database field and the software artifacts people are building a piece of it is shifting to in-memory implementations And some of the things we learned in textbooks about databases and disk blocks May still make sense in memory and may not and so we won't probably maybe at the end of the semester We'll get to do one lecture on like something fun with modern main memory databases from the last like two three years Probably we won't and I'll just flag this for you So when you go out in the field and you talk to people you'll be aware that your textbook database class is being revisited as we're looking at in-memory databases There's been actually some cool work on this at Microsoft actually so Microsoft research did a lot of interesting work on this That's being rolled into their product into SQL server and some of that is production now and Some impressive things so I can if people are curious Post a piazza that you'd like to see links to research papers and I can point you to some of that stuff Okay, if I don't hear from you, I probably will forget All right So much for Beatrice So we're gonna shift topic fairly significantly now and shift sort of flavor One of the cool things about database as a field as a technical field is it's kind of a microcosm of a lot of computer science Because really if you say well, I'm the part of computer science that worries about data You have to do all the all the computer science You know because like data is everything at some level and so these systems are microcosms of a Technical community that's had to think pretty full-stack about a lot of issues in computing And that means that this class tends to have a lot of different topics that have pretty different fields around them So what we're shifting to now is a conversation about languages All right, we're shifting away from a conversation about performance and implementation We're gonna shift to conversation about domain specific languages and semantics Okay, so it's gonna have a very different flavor. You may like this better. You may like it worse But it's just different. All right to give some context in the context of our system stack You know, we were down a level into files and access methods and we've been working our way up, right? And we started with relational operators like joins and things like that And so we've filled in to some degree an implementation of the whole bottom of the system below the query optimizer We've covered everything. Okay, so that's kind of cool like right now. You could build a single site database Where you have to program it in the iterators Which is sort of like single it's sort of like MapReduce really you've sort of learned The crux of that how to build a iterator based data flow system, right with joins and aggregations and all that What we're gonna do now is present a formal language for those relational operators things like join and group by and things like that That we've learned how to implement will now introduce a language to talk about it. Okay, and you might ask Why would I want to do that? Okay, we'll talk about that a little bit, too So relational query languages were the topic of much study in the 70s I would say and sort of got baked back then although People keep wanting to revisit them in the context of XML or JSON or whatever's the next file format that comes along But the guts of it end up looking a lot like relational languages with some extra secret sauce Okay, so this is the foundations of of the various languages that keep getting invented and reinvented So like if you learn the MongoDB query language Knowing this stuff will help you understand that very quickly It's like it's not like it really changed that much when we moved into JSON objects All right, but query languages. What the heck are they there domain specific languages for dealing with data for manipulating and retrieving data from databases Okay, and after all there is really nothing else to do in computers, but manipulate and retrieve data at some level You can you can disagree with me and we'll see but like you probably don't want to do go back and do your 61b homeworks in sequel That'd be kind of weird But we actually will make you do a little of that in your homework for this class Because it's actually quite doable and the thing about implementing it in Java is okay parallelize that man like I don't know I don't know how to paralyze my 61b implementation of Some data structure quicksword or something like that's right because you wrote it in a sequential language If you wrote it in a relational language, it would paralyze sort of automatically All right, so there is issues of scale that make these languages interesting For things that aren't like employees and departments and students and boats and reservations Okay, so we'll get to that a little bit, but query languages are very expressive actually some of them relational query languages You know come from a foundation in in first-order logic So the guy who invented this stuff was a cranky British mathematician named Ted Cod And he sort of came into this field where IBM had all these products with iSams and stuff And he's like this is all just set theory. It was ridiculous. Let me just write it down And so he invented these little languages that were just based in formal logic, and they're very simple And I think by our standards today. It's like obvious undergraduate material But in 1969 people are like no one could understand that it's got like upside-down a's in it backwards ease That's fine It's like generational right at the time it was perceived as being very mathematical. I think today. It's like it's pretty straightforward Anyway, so strong formal foundation. There's power in that formality though as we'll see So they're simple languages. They're crafted just to do enough for queries and that's an interesting debate over time How much is enough in one of these languages? So we won't get to talk about enough in this class probably, but there's they're designed to be simple small domain specific languages For data that are powerful in the things they can do for data All right The other thing about having formal semantics is they're gonna allow us to do a lot of optimization of your queries And that will come up over and over in the course of the class We'll be spending a fair bit of time on how we translate from these language statements to execution strategies via optimization So I sort of I threw this slide in last night because I was feeling sheepish about shifting you from B trees to relational algebra but You know why bother with this formalism after all we already have physical data flow I taught you about iterators you learn to map reduce in 61 B, which is you know just the way of applying opaque lambdas Right. I don't know what's in your functions, but I know I can map them or reduce them So why do we need anything more than that and at some level? Maybe you don't And in fact about three or four years ago if we had a conversation about map reduce and Hadoop in this very room because it was being Taught at Berkeley or in the industry people would say you don't need anything more than map reduce map reduce Let's me do everything I want look here's word count Here's a machine learning algorithm, and I've implemented seven things in my produce. It's fine But actually even in the three years that have passed most of the industry is saying boy You know finding programmers to write things in map reduce is really expensive and you know what they're really slow When they do it they write all this Java and then we have to debug it Can't we just like get sequel back, please? And so there's a lot of Companies and open source projects that just put SQL on top of map reduce which is what spark sequel which you've been hacking this week Is okay, so in fact, yeah, you often do not want to program at that physical level It's just too detail-oriented for a lot of very straightforward tasks that are well suited by a relational language So the other thing though besides convenience That's really important is this notion of semantic transparency if you write a map function and give me f right you say map And here's f the function. I want you to run. I don't know what f is. I'm not supposed to know what f is right? It's opaque That's fine, and there's a certain cleanliness to that separation of concerns But it makes it very hard for me the system executing your code to optimize anything because I don't know what f does what we get in a language that's More suited to the domain and more semantically rich is we can understand The details of what these expressions are doing and then we can play games under the covers to make them go faster Okay, so having this transparency of what is the user trying to say and not making it opaque But making it very clear in the language allows us especially when you have a little language like a relational language like relational algebra to do pretty rich program analysis and That program analysis will help us with query optimization for sure We're getting the user to specify their intent in a way that the system can understand But there are other topics that that analysis actually helps with that We won't cover in this class that are really super useful in practice and that like the Functional programming and map reduce community is kind of sneaking up on all over again things like data lineage if you get an answer Where did it come from? Why do you have it? So there's an answer to one of your queries Where did it come from well if the query was written down with a whole bunch of black box code in Java? That's a really hard question to answer. Why did I get a 17 in my answer? I don't know we can vote your code man Give me a 17. What more do you want for me? If you say it in a relational language? We might be able to say the following tuples from the database We're put together through joins and selections and projections and aggregation to produce that answer Maybe you want to go look at those tuples or maybe you want to look at the operators along the way that put those Tuples together so we can get pretty rich data lineage by inverting these queries because we understand their semantics Similarly things that we won't be able to talk about yet, but we could talk about later in the semester around Being able to store the outputs of queries So a view is just a query a materialized view is a query whose output you have saved So that later on if you ask the same query, we've already got it answered Well, what do you do with that thing when say someone updates the database now the output of the query should change I want to change that stored output without rerunning the query from scratch. How do I do that? So again, if we understand the query in detail, we can often find a delta query to run That just makes a little bit of work to change the output based on the change to the input Can't do that if it's expressed in a black box function f All right, updateable views. I have the output of a query I want to change the output of the query and then I want you to figure out how to change the database To reflect that the output of this query changed. So how would you make it that the answer to this query is different? I'll give you an example of that I Did a bunch of stuff with my database that they were stock transactions, right? And I lost a bunch of money. I would like to say Make me money on the answer to that thing and have you figure out how to change history? So that you would have chosen better stocks or it would have your stocks would have had better prices or something You want to go back to the source data and change the source data to get the outcome you want it? It's a little bit of a silly example because of course, it's the real world and you can't go back in time But you sort of get the point So anyway, these are topics we won't cover in detail But having a rich formal language enables a host of goodness, okay? So even though this is could be potentially tedious honestly for the next half an hour A lot of good comes out of having a formalism, right? And some people love formalism. Maybe you will find it not tedious but lovely Okay Just to be very clear the standard viewpoint is that query languages are not full-service programming languages They're they're little languages their domain specific languages just for data processing and we don't want a Turing complete language exactly because That would be too hard to analyze I want a little language that we can do good analysis on that's just enough for what I need So typically they're not Turing complete or at least not without calling out to some other language like you can invoke Python from Postgres's sequel, but that's outside the language sort of just a call out, right? And they're not intended typically they weren't that they're viewed database language were traditionally viewed They're not calculation languages. They're not general-purpose computing languages in the sense of I would like to compute some things Their languages for data, right? So that's the traditional viewpoint But the funny thing for the reasons I was intimating earlier is that in recent years Anything you want to compute of interest involves a large data set, right? Almost all of you will have a job in an environment where there'll be a big data set in the background somewhere That's driving that thing even if you're in mobile, right building mobile apps the thing in the clouds got lots of data It's pretty interesting so it's like everything interesting computing involves lots of lots of data and lots of machines and Actually query languages with extensions are quite powerful for expressing algorithms at scale in the way I described like Paralyzing Java code very hard paralyzing sequel quite easy In fact, we already saw how to you know really briefly and we'll go back to it But we talked about how easy it would be to paralyze Hashjoin or partition, you know or the sorting algorithm we learned right and that stuff is it's really almost that easy They're scheduling problems, etc. But it's fundamentally kind of easy. So Things like query languages like MapReduce, which is data flow underneath the query language These are all very valuable when you're trying to scale up And so query languages are kind of having a renaissance as maybe this is a good framework for programming many things Okay, in fact, and this has been a focus of work in my group They could even be an attractive choice for thinking about not even Large-scale analytics jobs like MapReduce, but little asynchronous programming environments like programming user interfaces Which are kind of event driven right or programming little protocols and networks Actually, you can think about those as streams of data going through little joins and stuff And some of these problems turn out to be really pretty and easy So we have a language called bloom which I encourage you to look at for doing distributed systems in a query language Kind of style a much more sort of applied language That's used in industry quite a lot is Rx which came out of Microsoft originally so like your Netflix browser is implemented in Rx What else is there stuff at Facebook and Google? I think that's implemented in Rx as well become pretty popular Microsoft open sourced it and the guru of Rx left Microsoft And it's kind of running around showing it to people so Rx is a language of you know sort of data composition It looks a lot like relational algebra for streams, and it's used for GUI programming largely right now so There's a fun article you can dig up actually in in the CACM, which is the main magazine of computer science It's called my mouse is a database That's by the Rx guy, and it talks about how he converts streams of inputs to the screen Into essentially things you want to query And that's how you program UIs so You're gonna learn about good old-fashioned query languages, but be aware that this may become a bigger part of your future than just databases All right There's two classical languages that Ted Codd invented in the late 60s early 70s one is the relational algebra and the other is the relational calculus and They're different in a way, so the relational algebra is kind of an operational language It tells you how to do a sequence of events very much like iterators and data flow that we've seen all right So it's useful for representing kind of execution plans, but representing their semantics not their detailed implementations The relational calculus is a purely declarative language. It's it's a logic language It actually has for alls and there exists and good stuff like that So it looks a lot like first-order logic and it says nothing about what order to do things in so you describe only what you want You do not say anything about how to compute it, right? So that's what we mean by a declarative language Right and the syntax of the relational calculus Describes the output of the query that's exactly what its meaning is Which is kind of interesting it doesn't describe a computation it describes an output And the relational calculus is essentially what SQL is SQL is Roughly a declarative language along the lines of the relational calculus By way of many English words And some mixed metaphors so but it is kind of the foundation for relational calculus and Ted Codd won the Turing Award back in the 70s for essentially Proposing that the relational model was good for databases defining these two languages and proving that their expressive expressivity is equivalent Let me say that again their expressivity is equivalent anything you can say in the relational algebra It's not Turing complete so you can't say everything but anything you can say in the relational algebra You can say in the relational calculus and vice versa It seems like a pretty academic exercise But the thing that it gives us is it allows us to take SQL queries Which just say what the output should be and don't say how to do it and compile them into relational algebra queries Which are pretty close to query plans So the foundation of how a database system is built where you start with a declarative expression in SQL you compile it into iterators comes from this theorem essentially that they're equivalent So in a classical relational database class, I would teach you both these languages And we would at least make an argument maybe a proof that they're the same I'm not going to teach you the calculus because it's really not very useful You won't pick it up again by nature ever So I won't teach you the calculus I am going to teach you the algebra because it's a sort of cleaner way to think about query plans And of course you'll learn SQL which will be a stand-in. It'll be the declarative language We will learn as SQL so today's the algebra though today. We're gonna learn the algebra so some preliminaries What's a query? Well the inputs to a query are relation instances. Remember this terminology There's a relation which is a description of a table. It's the schema and the instance the instances the contents Okay, so the input to a query is an instance. It's a data. It's a it's a database table with actual data in it particular state of the database, okay The result of a query is also a relation instance, right? So it's a closed language it takes in relation instances and it puts out relation instances And that allows us to compose the operators of the language one on top of the other So the schemas of the input relations are fixed So, you know your query is going to have certain input Relations and they'll have fixed schemas the schemas of the outputs are also well-defined and fixed So it's a strongly typed language in that sense you give it input tables with schemas There's a well-defined output table schema as well And you can figure out the output table schema from the query language syntax and we'll see this All right, this is just the very first place where we have a contrast with what you're used to in map reduce Map reduce right you get keys and values in and you get keys and values out But there's no way to know like what kinds of keys you might ever see what are the types of the values? All that kind of stuff is left up to the programmer It's very free which is nice, but it makes it very hard to analyze here We'll know exactly what the types of the inputs and outputs are now the pure relational algebra Which is what we'll talk about today has set semantics, which is to say in a relation There are no two rows that are exactly the same All right, if you try to put a second copy of a tuple into a table It's like it doesn't matter because you already have it is one way to think of it But there's no duplicates in these tables by contrast SQL has a multi-set semantics Which is like marginally more confusing to think about and we'll just not do that for today Okay, so today we'll stick with the classical relational algebra when we do sequel We'll have to worry about duplicates and duplicate, you know maintaining the right numbers of duplicates through Operations today we don't have to worry about Sometimes actually it's gonna cost us to have to get rid of duplicates, which is gonna be a drag All right, there's only five methods operators whatever you want to call them in the relational algebra Selection projection cross-product set difference and union Each operation returns a relation So you can compose these together you can string together sentences, you know functional expressions of these operators All right really briefly and we'll go through them in detail But selection selects a subset of rows it kind of allows you to make horizontal decisions about rows in the table Projection retains only desired columns, so it's choosing which vertical columns you want Cross-products gonna let us combine two relations and concatenate their results and then set difference in union are your familiar set Operations, but over relations rather than over set items Okay, and let's go through them in detail In this lecture as in the book we will have the very exciting boats database Okay, so the boating club has boats, which is a table that has BID B name and color It has a reservations table with an instance R1 All right, so R1 is the state of the database today. There's two records in it And that's the reservations table essentially and then we'll have two different instances in our examples Just so we can see some variety of the s relation, which is the sailors relation All right, so s1 has three tuples s2 has four tuples It's missing Dustin and it added yuppie and guppy Okay So this will give us a little variety to play with as we go All right first operator projection pi the way you write it is pi subscript List of column names you want to keep so in this case pi sub age of a relation s2 says please produce a Table in the output that has the column age from the input You can have more than one column you could say pi sub s name comma reading All right, and it just basically retains only the attributes of the input that you want that are in that list So the schema of the result is apparent from the syntax, right? It's the column names that are in that list with the same data types as the input columns of those names It's very straightforward. We know exactly what the types are and the names actually of the output columns based on the input columns All right the projection operator however because we said we're good We're not gonna have duplicates if you implement this you have to actually eliminate duplicates because when you do projection you introduce duplicates We'll see this in the next slide Now in practice real systems don't do duplicate elimination Unless you ask them to why would that be? Suppose well, let's come back to that question in a second. Okay, let's look at an example We're gonna do projection on the upper right of s name rating on s2 So here's s2 on the left s ID s name rating age Here's the projection on the upper right of s name comma rating. All right, you'll see we only get those two columns That's name and rating. Here's the projection to age Not only do we only have one column now we only have two rows because we remove the duplicates All right, so the output may be fewer rows than the input because a duplicate elimination Okay, suppose I gave you a table with a million rows and it had a column called gender What would be the projection what would be the size of the relation projected to gender? Maybe two or something depending on kind of your The way you view the world, but it would be a small number. All right Certainly wouldn't be a million unless we're really an advanced age and so I guess what I'm trying to say is that you're gonna do a big change from the input to the output what is Why might you not do that if you're building a system by default? Why might you not do that the question from the previous slide? Yeah, awesome So if you know this if you don't eliminate duplicates Then the the number of rows in the output is equal to the number of rows in the input and that might be useful for a bunch of planning Purposes, maybe you need to figure out how much memory is going to be used by the next operator Which might be a sort and we need to know if it's square root of b big or bigger Actually, it's b squared big or bigger, right? So we can do two past sort So we'd like to know how big the output would be next time and we know for sure if we don't eliminate duplicates It's the same size as the input. That's a great point What might be another reason we might not eliminate duplicates by default? Well, how would we eliminate duplicates? What algorithm might we use? Sorting might use sorting or we might use hashing, right? Both of which are expensive, right big table if to sort it or hash it It's gonna cost you two passes of the table at least okay, so it's expensive So you don't want to eliminate duplicates if the user doesn't care Okay, so by default in most systems you wouldn't implement duplicate elimination unless they ask But in the pure relational algebra you have to implement it okay to get the right answer All right fair enough. I just want you to sort of focus in on that That's it's gonna affect the sizes of the outputs of our table. So projection We know the widths of the table. We know the types. We know the names, but we don't know the height of the table It's smaller than the input, but we don't know how much smaller Okay So that's projection selection selects rows that satisfy some selection condition, which is a Boolean expression Okay, true false expression and the result is a relation with the same schema because we're just picking rows out those rows will Look just the same at the output as they do in the input So the schema of the output is the same as the schema of the input. Do we need to do duplicate elimination for selection? in order to maintain our No duplicates rule anybody How many people think that we need to do duplicate elimination after selection? Please raise your hand How many people think we do not have to do duplicate elimination after section? Maybe it's an obvious point. Okay, fair enough We're in essence throwing things away, right? We're not going to select all of it But the most we can select is everything and everything didn't have duplicates So something less than everything certainly doesn't have duplicates, right? So selection can't generate duplicates Here's an example rating greater than eight. All right It's gonna not choose that row and it's gonna not choose that row and it's gonna choose all the other rows What it what remains from the selection is clearly a subset of what we had what we had had no duplicates Okay, here's a query that takes that selection rating greater than eight and projects it to s name comma rating All right, so it's gonna throw out those columns leaving us with Those two things right and so you can see how a combination of selection and projection can let us zoom in on certain cells So to speak of the table right and put them together With the ability to cross out rows the ability to cross out columns All right union and set difference are these set operators, but they're out of relations now not on sets And so in order to make sure they make sense those relations have to have the same schema more or less Right, you can't take the union of something with 12 fields and another thing with three fields It's like you'll end up with a relation that's kind of got ragged size two poles and it won't be relational anymore It'll be oh my god XML. So we're not gonna do that. Okay, not yet We do an XML if you want but it gets ugly so We're gonna say that two input relations have to be union compatible Which means that they have the same number of fields and the corresponding positions position one of this table position One of that table have the same data type or maybe a union compatible data type So when you get into details in an implementation, you might say fine We can union integers with floats and we'll promote the integers the floats Okay, so you can you can play some games like that, but you can just for simplicity Let's say that the types have to be the same. We don't have to have the same column names All right, that's the one thing will be will be a lenient about All right, so pretty straightforward the union of s1 and s2 here is that Note that s1 has three rows as two has four rows and s1 union s2 has five rows Because of duplicate elimination. So we do have to do duplicate elimination after union, right? set difference as You would expect is going to remove the things from the right-hand side that are in the left-hand side. All right So s1 minus s2 all the things in s1 that are not in s2 s1 is smaller than s2 That's okay, right? There's nothing wrong with that. We're going to take away from s1 those things that are in s2 So the output is going to be a subset of the left-hand side, which is s1, right? Make sense So it's perfectly possible to have the cardinality of the right-hand side of set you set difference be a Bigger and super set of the left-hand side. That's fine Yes Right, so the question was is there some notion of like if something's in the right-hand side Is it somebody you get it like a negative somehow from the left-hand side? No because we're not doing counting Right, so all things can be as present or absent and if something's present in both s1 and s2 Then it's absent in the output if something's present in only s1 that it's in the output if it's present only in s2 It's not negative. It's just not in the output Okay, however when you do multi set union like in sequel then you start to do counting There's three of them here and there's two of them here so the output will have one stuff like that You still don't go negative All right, so we tend not to represent negative information in these in these things But good question. So for now though, we don't have to worry about about counting It's just either there or it's not and to be in the output. It has to be on the left-hand side Here's s2 minus s1. All right, not a lot of symmetry there. That's okay You know, they're just they're different Set difference is a funny beast actually so it turns out like this all obvious right blah blah blah set different set union We knew this you learn this like maybe freshman year. Maybe when you were in elementary school set difference is funny so Almost all the relational algebra in fact all the relational algebra operators except set difference are what's called monotonic What that means is that is the inputs grow The outputs grow another way to say that is if you take a query like over some tables So Q over some tables R1 s1 t1 of that are instances and then you take one of those and you replace it with a bigger instance R2 that has some extra tuples in it The output will be strictly a superset of the output on the smaller input So as we inserted things into R the result of Q got bigger Setwise that's that's the monotonicity property of these operators and that's true for everything except Set difference and set difference if we put more stuff in R1 s1 minus R1 We put more stuff in the right-hand side new tuples. They will knock out some of the tuples on the left-hand side So if you grow R and you choose R2, that's a superset of R1 Then s1 minus R2 is actually a potentially a subset of what it used to be because by adding things that we subtract away The result may have fewer things, right? So this turns out to have all kinds of interesting implications Simple one of this is when you implement set difference It's a blocking operator by which I mean you can't compute the answer to the query till you've seen all the tuples of R So you have to say get next on R s minus R, right? Say get next on R until you've consumed all of R and only then can you start producing tuples of s Just think about it if you have a tuple of s you're like ha ha I have a red pen if there's no red pens in R I can output a red pen When do you know there's no red pens in R only when you've seen all of R? Well, that could be kind of a drag if you don't really know how big R is Right, so what do you do on this side of the the operator? Well, you kind of have to wait until you do all the computing on this side of the operator All right, so you have to do what's called a blocking operator. There's no flow upstream till all of R has been read That's kind of a drag. All right, it gets to be a much bigger drag when we start using these kinds of operations for broader classes of computing like distributed systems or Interactive interfaces where suddenly the world kind of stops while you're waiting for something to come and That's something that's gonna come isn't to scan of a table It's like you're waiting for a network message and you don't know when you're gonna get it So if the ours were streaming in off the internet from agents You know We'd have to wait for them all to send all their stuff and that might take an arbitrary amount of time And we just have to wait Which thinks it turns out this whole notion of non monotonicity is really important And I won't go on much more about it right now But set difference is the only operation that has it and that non monotonicity baked into set difference anytime you see anything in your queries that feels like Blocking or it feels like removing things you're gonna use set difference to say it right. It is our one non monotonic operation cross product We've already started. I drew this first week of class. I think but right It's the Cartesian product of the space of tuples and s and tuples and r all pairs of s1 and r1 How many rows are there in the result queue r1 cross s1? the number of rows in r1 Times the number of rows in s1 right okay So we know the size of the output and we know the result schema the result schema is one field of s1 It's all the fields of s1 concatenated then with all the fields of r1 if it's s cross r Okay, so you take all the columns of s1 and you just can catnate out the columns of r1 and the field names are going to be inherited When possible and we'll see what happens when they're not possible So if you have a naming conflict suppose you're joining two tables and they both use the same column name Then you're gonna have to do something to rename the columns of the output So there's this little annoying a little extra operator for syntax called the renaming operator row where you can take an input Like the input here is here the second argument in this case It's an expression s1 cross r1 and the renaming expression which is the left of this comma gives it output name for the relation output and a Mapping of column positions the column names that you want to change So this is saying column one call it s ad1 column five call it s ad2 Okay, so you can always make sure to impose names when you have ambiguity All right, so here's a cross product of r1 and s1 We can catnate the schemas and we form all pairs of rows And you can see that because they both have s ID. We don't actually have names for the column one and column two three four five So we'll have to use a renaming operator to rename Now intersect we would have expected to see but we don't need intersect in the language intersect is just a Composition of operators. We've already seen okay, but it's handy. So let's define intersect with respect to the other operators So there are a bunch of operators. We like that would they're essentially macros on the basic five They add no computational power to the language. They're just shorthand macros you can express them with the basic ops Intersection, you know what it is right? It takes two input relations, which must be union compatible otherwise it wouldn't make sense and It returns those things that are in both relations, right? So what is our intersect s? How do we express that in terms of? The operators we've seen so far Let me give you a head start Maybe we can make it r minus something because after all it's going to be a subset of r, right? Anybody see what it is. It's our what do we take away from r? So all the things in our except those things that are Not in s right all the things in r except the things in r that are not in s Okay, right, so it's this yeah cute. All right. That's what it is So intersection of two relations pretty straightforward same schema the tuples that are in both right The join is also a compound operator. It's not part of the relational algebra technically All right, it's cross product followed by selection and for some variations of joining We might also make it have a projection for the what's what could call the natural join What's the natural join? We're just going to write the natural join is our bow tie s Our bow tie s which is the natural join is defined as well. Here's a way to think about implementing it This is not a definition. This is operational, but it's fine compute our cross s Select the rows where the attributes appearing in both relations have equal values So they have to actually match on attribute name like s ID from this table and s ID from this table So if you have two columns here that have names that both of them appear here Then you're joining on pairs of like s ID comma last name s ID comma last name All right, so you find all the matching column names and then you look for matching values in the columns with matching column names I'll show you this in a minute. So you select rows where the attributes that are in both relations have equal values It's kind of an implicit equi join on all of the matching column names And then we're gonna not keep pair, you know because we're matching like s ID here equals s ID there There's no point having two s ID columns in the output. They're gonna have the same value So we can drop one of them So we're gonna project all the onto only unique attributes with one calm with one copy of the common attributes This is a stupid algorithm because you form the full cross product before you do the selection We'd much prefer to do hash join, right? This is an equality join So we should do hash join or sort merge join or index that's the loop join or something. This is just a sketch So here's the natural join of r1 and s1 So it joins essentially the s ID equals s ID That's what the natural join is here that the matching column names are s ID so we'll look for matching s IDs and What are we gonna get we're gonna get 22 and the first thing is Dustin and 58 and the first thing is rusty So you should get two rows in the output and don't notice only one copy of the s ID column because we projected it down to just one of them Okay It's natural join the other types of joint theta join instead of having this weird matching name equality You can put any predicate you want there any conjunction or even you can do disjunctions and negations if you want Any Boolean expression over the columns that evaluates to true or false can be your theta All right, and we're just gonna compare up those columns between r and s based on that conjunction and produce the output So here's an example. It's actually the same query as last time Joining on s ID, but we've made it explicit as an explicit theta subscript of the join bow tie And there's the output note that this one doesn't do any automatic projections. You have to ask for them So theta join No free projections involved the result schema is the same as the cross product although of course Because of the theta there may be fewer rows in the output than in the cross product We'll often talk about equi joints, which are theta joints where there's only a quality predicates in conjunction with each other So, you know column one equals column two and Column three equals column three You know, it's that or it's that or so T1 Why do I only want to allow ands in my equi joints? Why not ours? Well, this is a little bit fussy, but if we only have conjunctions Then we can do things like use indexes with composite keys or we can have a hash function on the concatenation of this and this Concatenate those together and hash them and then this and this concatenate those together and hash them and use a hash Join when you're doing conjunction. It's pretty much like doing a single equality It's just that equality has little pieces to it when you do or it's actually an if statement It's like if you match on this you go to the output or if you match on that you go to the output And that's really a union query in essence It's like a join union another join and it doesn't fit into one joint algorithm You have to do the joint kind of twice-ish Okay, so we're only gonna call things an equi joint if they're a single equality predicate or a conjunction of equality predicates Example of an equi joint. Oh No more examples. Sorry, so now let's put together some queries And so these are usually examples of the whole language, I guess Find the names of sailors who've reserved boat one or three actually, let's take a stretch because As I said, this could be a little tedious and then we'll go through some examples So I want to go through a bunch of examples of turning English language queries into relational algebra queries It's gonna exercise our understanding of the concepts and I gotta be honest with you This is an easy thing to put on a test All right Because it's an easy way for us to tell if you're understanding stuff because you can just read off whether you got it right It Unfortunately unlike other programming languages you can't play with this one in an interpreter because I don't have an interpreter for relational algebra to give you And so you just kind of have to make sure you understand it. It's not a very easy thing to try There's no like try relational algebra calm that you can go and test it out in your web browser So I apologize for that, but you should feel responsible for being able to string together these operators If we forced you to program iterators like map reduce style and we gave you a join Reducer let's say and we gave you a selection and a projection and all these things we could give you programming assignments But that seems like a lot of work But that's essentially what we're doing. We're forcing you to program by stringing together these things I'll make one more point just to make this painfully clear when we write something like R1 Equijoint S1 And then let's project down to I don't know R1, you know SID and Select and do a selection with like, you know, SID equals 4 So that's a relational algebra query. You could write this as an expression tree, right? This is what put a parser would do it would kind of look like this Hey, guess what that looks a lot like a query plan doesn't it? So these are pretty close to the query plans with iterators that we had before they really really are okay It's just a you know functional notation or nested notation for trees and The only thing that's kind of missing from the relational algebra are things like sort or Unique where some of the iterators we looked at that have no Meaning to the output of the query in a set oriented sense So sort doesn't change the definition of the set at the output because sets don't have a notion of being sorted Unique doesn't change the definition of a set because sets don't have duplicates, right? So we learned some operators in iterators that are important for implementation But not important for the meaning of the query and they're not in the relational algebra But many of the things that we learned are in the relational algebra and you were essentially composing dataflow plans So if you want you can think about this as like MapReducer spark programming Okay, if you prefer you can think of it as just a formal language. Anyway, you will be responsible to be able to do it All right find names of sailors who have reserved boat number 103. We know the schemas of our tables. Here's one solution Find the reservations for boat 103 from the reserves table. So select BID equals 103 from reserves for those reservations join it with the sailors Now we've got the sailor names that are associated with those reservations and Then project to the sailor names Now note that we needed both tables because the reservations had the boat ID And the sailors had the name of the boat of the sailor, right? So the query asked about things that were in the reservation table and things that were in the sailors table We had to form the join even though the output only has fields from the sailors table, right? Okay, so sometimes you have to go through joins and stuff even though your output won't reference the tables you joined to Here's a different solution same query In this one first we join up all the reservations and sailors So we get a list of all the boat IDs and sailor names and everything that ever were associated through reservations And then we just pick out from that complete set the boat ID equals 103 and then we project down a sailor name All right, it does give the same answer and you should convince yourself that it does Which one do you think would run faster based on what you know about joint algorithms the first one? Why? Got it in the first one the selection on reservations is going to make the inputs of the join smaller Right fewer reservation tuples go into the join We know joins are expensive roughly You know some some multiple of the sizes of their inputs in terms of pages and so making inputs to join smaller seems like a good idea There's a big assumption in that statement though anybody catch the assumption there Well the assumption is that selection is cheap. What if selection was wildly more expensive than join? Then there'd be a bad idea to do it the first way to want to do it the second way, okay, so Most of the time in SQL selections really cheap you look at a tuple It's a map function first of all to be very clear right you look at a tuple You can decide just looking at that whether the selection applies or not so that's good You don't need a whole lot of memory. You don't need to buffer anything. It just flows through you can do your selections And only pass along the things that pass So that's good and in sequel the expressions in the Subscript of the selection are typically simple arithmetic operations However, when we're programming in like toolkit environments like spark or or you know modern databases with user defined functions He postgres is an example where you can plug user code into your selections You could have you know select from reserves where f and f is a python function Maybe that python function like goes and does look ups into you know Google Maps API Perfectly reasonable thing to do. I want to find all people whose address is within 20 miles of a McDonald's But all that mapping stuff is not in the database. It's gonna make API calls out to some remote site That could be super duper slow and then selection becomes way more expensive than join So it is possible that your selections will be expensive in this class. We'll pretend. It's the 90s. All right I actually did my thesis on this topic and that was in the 90s But we'll pretend it's the 90s, and I'm the only person thinking about this We'll assume that selections are cheap Okay, selections are free essentially as the data flows through memory You do a very simple piece of arithmetic and you just pass it along. No. Ios. It's free joins are expensive So we want to do selections before joints good That all makes sense that little rant Find the names of sailors who've reserved a red boat The information about the boat color is only available in the boats table But we need to know about reservations because I want sailors who've reserved the red boats And I need the names of the sailors which is in the sailors table So I'm gonna need sailors and reserves and boats and I'm gonna need to join them all together now. Let's do that We need just the red boats So we could push that selection in early and we'll join that with the reservations to find all reservations for the red boats Then we'll join that with sailors to find all the sailors who reserved red boats And then we'll project it down to sailor name and we get the names of the sailors who have reserved the red boats Okay, just chain in this stuff along Same reasonable Here's a more efficient solution We're gonna push some projections down So we're gonna take that boats table And we're gonna not only limit it to the red boats But we're gonna then throw everything about the boats away except for the boat ID Because that's the only thing we need to pick out the reservations for those boats And then once we get the reservations for the boats computed We're gonna throw away everything but the sailor ID from those reservations because I don't care what date it was or any of those Things I just need to know the sailor ID so I can join it with the sailor table and then the results of that will project down to S name so in some sense we added some projections Another way to look at this is we kind of pushed the projections down at the end We were gonna throw away everything but S name so what I did is I started throwing things away earlier In fact as early as possible What's the benefit here? Let's assume projection is Free or cheap okay, and what is it saving us on? Joint costs again. How does projection save us on joint costs the same number of rows going into the joint after a projection is before Yeah Exactly although you have the same number of rows They may be much skinnier rows after projection you can pack more of them on to a disc buffer paid disc buffer in memory Which means you do fewer Ios to your temp files when you're doing sorting or hashing or what have you right? So you always want to squash those tuples down even if you have the same number of tuples in fact there's the Whole sort of industry segment a bunch of research papers in recent years I want to call column oriented databases you may hear it here this when you go out in the field or column in our databases The big idea there which is pretty small idea It's just kind of they beat it to death was let's do projection as much as possible Like let's store the columns separately on the disc and only read the columns. We care about let's let's have projection go before scan essentially So we've already stored the tables one column separately and for the other So when you do a scan of three columns you just pick up those three columns from the disc You don't read any of the other data So it's less disc I owe and then all the operations along the way are going to try to do things with as few columns as possible All right And when you only have things in single columns you can even compress the columns And so that's the whole trick of column oriented databases is basically push that projection down even into the storage layer Question I'm so glad you asked that question so the question if for those who didn't hear maybe those in TV Are all these operations commutative and the more general version of that question is what's what are the rules for you know Playing with these expressions, and that's why it's a relational algebra actually an algebra Right a modern algebra is a syntax in which there are rules of things like associativity and commutativity of these operators And it's the specification of those rules that defines what you can do in the algebra So we're gonna work on that answer to that question when we talk about query optimization You're getting hints of it now clearly we can do something about moving selection before join clearly we can do something about pushing projection before join But the actual rules for it I'm gonna save for a little bit so we get a little more comfortable with the language and we're starting to talk about query optimization But those are exactly the right questions to ask commutativity and associativity are the basic things we're gonna want to know about And not everything commutes with everything Give you a heads up Find sailors who have reserved a red or a green boat Okay, well we saw how to find sailors who reserved a red boat, right? Find the red boats find the reservations for the red boats. You find the sailors who reserved the red boats Now I want red or green that should be pretty easy, right? I'm just gonna change the selection to say red or green I believe that works. I think it works. See if it works. The slide's gonna say it works So it must be true. Okay, if it's in the slides it must be true So we're gonna just for fun throw in a row here, so we're gonna take the boats table We're gonna do a selection where color equals red or color equals green and we'll give it a name temp boats So now we've got this thing which isn't really a table at all It's an expression called temp boats in relational land. That's usually called a view All right relation. That's not stored as a view. It has a name. It's called temp boats We can talk about it as a schema, but it's not a table. It's a view, right? I think in spark. This is called an RDD. It's a It's a it's a file of data that doesn't really exist and we can generate it when you need it by having its expression and its lineage Okay, so now that we have this view We can use it in a query. We can say find the temp boats and join them with the reserves and the sailors I project down to S names. So those are the red or green boats Either boats that are red or boats that are green join with reserves and sailors get the S name bingo That's gonna get us what we want. That is correct. Now. Let me ask you another question What about sailors who reserved a red boat and a green boat? Wouldn't it be nice if we could just take the previous slide and change the orange to and that would look like this What's wrong with this query? Yeah There are no boats that are both red and green All right, there are lots of boats that are red or green the red boats are red or green and the green boats are red Or green right, but there are no boats that are red and green because it's a relational model And you can only have one value in each color cell We're not allowing you to store multiple colors per cell Okay, so there's there's no like there's a red in there and a green in there. It's either one or the other All right, and so this actually returns nothing This returns the empty set. It's a well-specified query. It just happens to return nothing All right, and so we're gonna have to do something different. This is bad What operation do we want to use it looks a lot like the thing in the circle, but it's bigger Intersection right, so we're gonna do it. Sorry cut you off. We're gonna do a set intersection. Was that a question or an answer? There's an answer. Yeah, good. So we're gonna use intersect to do this. All right, so that's wrong to make that painfully clear So we're gonna identify sailors who preserve red boats sailors who resume green boats and find the intersection Now happily SID is a key for sailors. There's only one row and sailors for each SID So we can find SIDs from both of those two things and just intersect the SIDs There's the red boats temp red. There's the green boats temp green you take these two views and form their intersection and join it with sailors and Then you can project down to S name Right and in both cases the only column we needed was SID from those two things, right? Cool questions All right That's it for today actually so the summary is the relational algebra is this little bitty language It's a lot like the data flow systems that you are using like spark and MapReduce It's a lot like the query plan iterators we describe, but it's semantics. It's high-level language. It doesn't have an implementation It just has a meaning so but it is operational right We nest these things in order and that question that was asked by the gentleman behind the wall is exactly the kinds of questions You can ask about a formal language. What commutes? What associates? What rewriting rules can I use that get kind of obfuscated when you're dealing with an implementation language? So you don't tend to ask those questions of implementations. It's nice to have the semantics written down So it is operational you nest things, but we know the semantics and we can think about those nest things It is a closed set of operators So it's relations in and relations out and you can mix and match And construct these sentences out of it There's five basic operators and a couple natural compound operators and set minus is the only thing that's non-monotonic So I think that's everything you need to know about the relational algebra. You have a big homework due tonight And I guess what you might want to do in the next 15 minutes is I can stick around the TAs can stick around And if folks want to use these 15 minutes to come ask us questions, please do Good luck with the homework