 Okay, let's get started. All right. Today, we will begin our discussion of SQL, the structured query language, sometimes pronounced SQL. You can say it however you like. Just a bit of review from last time. We talked about the relational algebra last time in some detail. And remember, it has a somewhat operational semantics where you compose essentially a tree or a nested description of operators to answer a query. So it's sort of step-by-step. It's just that the steps are these big set-oriented tasks, right? Big disorderly set-oriented tasks. And then the nice thing about the relational algebra is it corresponds quite nicely to our notion of query plans and iterators, but it gives it a formal grounding that we'll be able to use for optimization. We also spoke but did not look at the relational calculus, which has a declarative semantics. So rather than having an order of operations at all, you simply describe in logic what the answer set shall include. I would like all tuples such that, blah, blah, blah. And you don't say anything about how to achieve that result. The sort of algorithmics are left to the computer, and the programmer, if you will, is really just specifying a description of the output in syntax, in logic, which is a crazy audacious thing to do, right? Step back for a minute and imagine that I told you that from now on you will no longer write Java or Scala or Python. You'll simply describe the outputs of your programs. Say, well, that sounds really hard. That seems crazy. But that's actually what SQL is about, for at least a limited set of tasks. That's exactly what we're saying. We're going to say, from now on everything's declarative. We will only describe the outputs of things. We will not talk about how to achieve them. And computers with all their smarts and extra cycles will figure out a good way to go execute the tasks we want them to execute. This is happening increasingly up the language stack, I would say. Declarativity is creeping into your programming languages more and more. And it's sort of the most well-studied and mature example of this is SQL. So relational calculus has this flavor of declarative semantics. These two things are simple and powerful models for query languages, but rather than look at the calculus, we're going to look at SQL as a much more widely used and easier to learn declarative language. A note, and this is review again on expressivity. An important question you should ask with any language is just what is sayable in that language. So in a query language in particular, they're often not touring complete. So you can't say everything you can say in C or assembly language. So what can you say? It's a question of computational complexity. What's the class of algorithms in terms of complexity, typically, that you can say with a query language? SQL is actually a very powerful language. So you can even say exponential things in SQL. But the basic relational algebra, for example, can't even say all of poly time. So depending, maybe if you have subtraction, it can actually. Take that back. Maybe it can say all of poly time. Oh boy. But in any case, it can't say exponential things. So there's a question here of kind of how much you can say. And oftentimes, the way we talk about that is what category of algorithms can we describe. So Cod's theorem said that the algebra and the calculus have equivalent expressive power. What that means for us is that we can take declarative statements a la SQL. And we can compile them down into operational statements a la query plans. So this pointed the way towards being able to take these high-level declarative statements and have a natural implementation for them. SQL is more powerful than the relational calculus. Or the relational algebra, therefore. But it can be captured with a set of operations and a way to string them together that looks a lot like the algebra. The iterators we have in our relational databases look a lot like the relational algebra. So you don't have to go a lot further than the bones of the algebra to get all of SQL. One key difference that we'll see today, there are others. But one key difference is that the SQL, by default, we're going to be dealing with multi sets of records, rather than sets. So each record can appear more than once in a relation in SQL. And we're going to care about that and account for it in all of our operations. And we may not talk about that in a ton of detail today, but it's going to be lurking there in the background. Just to give you a picture of how this stuff works in practice, typically you sit down and you write an SQL query. The query optimizer is going to go off and generate potentially many relational algebra expressions, query plans, if you will, that could execute that query. And they're all equivalent. So all of them will produce exactly the same answer, module of city order of the output. And so the goal of the optimizer is to pick the cheapest one. And later on in the semester we'll talk about how it goes about this process. But I want you to be very conscious of the fact that there's a large number of possible execution strategies for your simple SQL query. And you don't have to think about which of them is best. Inside of SQL there are sort of two sub languages. There's what's called the data definition language, the DDL. And that's for doing things like setting up tables and modifying them. Creating the schemas for tables and altering the schemas for tables, defining constraints on those tables, which we'll talk about a bit today and much more later this semester. But the juicy bit, the part that we're going to look at today is the data manipulation language, the DML, where you can write queries. The piece of SQL that we'll be interested in today. The database management system is responsible for the efficient evaluation of these declarative queries. And the key to being able to do this is having the semantics for the queries and Cod's theorem pointed the way towards the idea that you could unambiguously map from a declarative query to one of many possible operational plans. The optimizer is free to reorder the operations in these plans without affecting the query answer. And the rules for those optimizations, which one of you asked about last time, those rules will come clear over the course of the next few weeks. Okay, so today we're going to talk about SQL and we'll continue with this on Thursday. It's going to take at least two lectures to get through all the material in SQL that I want to get through. So we're going to do more than just sort of standard stuff. Why teach SQL? Well, it's the most widely used query language, other than, say, search, then free text search. So it's not just the most widely used relational query language. It's really the most widely used query language. It's well standardized and has been for a long time. So the core of it is something that is shared by most database implementations. Although most systems also add some extra bells and whistles, which are idiosyncratic. Postgres is an example of that. And then there's some recent systems, things like Spark SQL and Hive, actually, in the big data space, where the SQL isn't quite compliant because those systems are actually just so young that they haven't been pressured by their customers to comply with standards. I fully expect that they will become closer to the SQL standard over time. So learning SQL is a very, very safe bet. In this class, we're going to study the basic constructs. We're not going to learn every corner of this language. It's a very big language. Ironically, it was designed to be a very little language. And the guys at IBM who designed it, I know them, and they said that they're sort of horrified at what happened to it. It was supposed to be something that, like on the garage floor, could go get some parts from. It's supposed to be easy to read and easy to type. And it's turned into something that's really quite powerful. What you'll see even today, even on the subset we're going to look at, which leaves out all sorts of esoterica, like how do you deal with geographic data and how do you add user-defined functions and so on and so on. What you'll see today is that there's a bunch of subtleties that arise through things like duplicates and multisets and things like set minus, actually, and various forms of that, which complicate the language in subtle ways that make it a little testy. So we'll have a look at it today. All right, here's our little example database. We're going to have three tables, and they've got instances. So these relations actually have data in them. And I went ahead and typed these into Postgres last night, so we'll be able to run some queries against this stuff ourselves. Sailors, boats, reserves, all right, give me and make sure I didn't screw up last night. Oh, come on, full screen for me, please. This is going to be annoying if I have to do this every time. Is that it? No, that's the wrong window, that's why. Oh, come on. Here's my Postgres window. And by the way, you know, Postgres is set up for you inside your VMs, and I definitely encourage you to go ahead and try this stuff out at home. Everything in the slides as of last night pretty much cuts and pastes directly, and it works correctly. So here we go. Here's our tables, right, backslash d, and the Postgres prompt gives tables. And then if we want to look at one of them, you know, let's make sure our data's there. There's boats and sailors. And let's make sure there's some reservations. Awesome, okay. The reservations are in the future, which is kind of interesting. That's an accident. Okay, so here we go. So the DDL to create these tables, we're just going to learn enough to get you dangerous on the DDL side. It's not very interesting anyway. We're going to be able to create a table called Sailors. You give a list of columns and types. Types come after the column names. So SID is an integer. S name is a character array of width 20. For reasons that are somewhat historical and somewhat for performance, it's sometimes good to fix the width of your fields, a maximum width, just so things are more compressed on the disk. If you just say text all the time, instead of character with how long it is, then the system will not be able to have fixed length tuples and pack things as tightly. The one thing that you should note here, this DDL with real is probably wrong. That should probably be float. But beyond that, primary key SID is an important detail. So we're going to have what's called a primary key for this table. We'll talk about this more later. What we're saying is the SID column contains a unique identifier for every row, and there can be no duplicates in the SID column. That's what that primary key syntax says. And as we talked about previously, you can have multiple columns in the primary key, like last name, comma first name could be a primary key. So the primary key is actually a list of columns. Here's the boats table. It's got a primary key called boat ID or BID. And then here's the reserves table, and it's got a primary key, interestingly enough, which is all three columns. So each sailor, each boat, each day, that's unique. So a sailor can only reserve a given boat on a given day once. There are no duplicates in this table, is what that's saying. And you can set that up in the system. It will guarantee that these constraints about primary keys remain true. If you try to insert a duplicate, it will reject your transaction. Okay. All right. So you can think of these identifiers in this little schema here as pointing to each other. So the BID of reservations will point to the BID for boats. The SID for reservations points to the SID in sailors. Right? We'd like to formalize that with the system. We'd like these pointers to always be true. So we're going to tell the system about these in the DDL for the reserves table. We're going to say that SID in the reserves table is a reference to sailors. We call that a foreign key. And we're going to say that BID in the reserves table is a reference to boats. So the syntax is foreign key instead of columns, references table. Now, notice that we didn't say which columns in the sailors table it's referencing, or which columns in the boats table it's referencing. So that's the syntax here. When you say references table you have to have something that is the same type and the same number of columns as the primary key of the table you're referencing. Another way to think about this is how do you address the sailors table? How do you identify a unique sailor by SID? Because that's the primary key. So references to sailors must also reference SID. Think of primary keys as addresses if you want. The types of those addresses have to be used elsewhere. Okay, so there we have a table and by the way now the reserves table can only contain reservations for real boats that exist in the boats table and real sailors that exist in the sailors table. And this will be guaranteed again by the system if you try to mess with those references the system won't let you. So here's our first little query that we've done this in single table SQL earlier in the semester but methodically today with the book. So find all 18-year-old sailors select star from sailors as S where S dot age equals 27. Sure we could go do this in Postgres if we want but let's maybe not right now should we? No it's too hard to copy paste from these slides. We're pretty clear what's going to come out of that it's going to be Nancy and if you didn't want to get all the columns out then we've seen this. Moving on to join queries now and I've been doing this on the board but we haven't actually had time to study it we're going to talk about how to express joins a bunch of different ways today. Here's a reasonable way to express a join we're going to select S dot S name from sailors as S reserves as R so we're going to take the S name out of sailors where S dot SID equals R dot SID and that's our join predicate it's going to say we're going to match up things in the cross product if the SID of sailors equals the SID of reserves and R dot BID equals 102 so I want the name of sailors who reserved boat 102 and this one maybe real quickly we'll just run this in Postgres and there's Fred and Jim of both reserves 102 let's see if we believe that Fred is SID 1 Jim is SID 2 oh yeah oh it looks great over here though sorry about that rather than mess with the display which has been a challenge all along let's just do this let me shrink this this should be reliable from here on I think for the rest of the lecture good there's Fred and Jim there are the reservations at the top Fred is sailor 1 Jim is sailor 2 they've both got 102 so this is great okay going back to the slides generally speaking we've seen this before but here's some of the naming conventions from the book your basic SQL query is this block select distinct target list which is this list of expressions over the attributes of the table and the relation list the relation list is a list of relation names possibly containing these range variables which are the variables we put after the S clause so sailors as S I will point out the word as is not necessary actually in that particular location you can say select star from sailors S where S dot something equals the distinct clause obviously is optional it's whether or not at the end we remove duplicates and the qualification in the where clause as you know can be pretty much any boolean expression you like over the range variables okay so these are the syntax for a given query like this are pretty straightforward and they kind of work inside out so you start with essentially the from clause and the first thing you want to think about is the cross product of all tables and then over that cross product we evaluate the where clause to check conditions and discard tuples that do not pass those conditions and then the select clause removes any unwanted fields and computes any expressions that you want to compute the distinct clause which is optional now obviously this is a lousy way to execute these queries and a query optimizer will use join algorithms to do something better the more general flow though with group by and having kind of looks like this think about the from clause then the where clause then the select list and then think about group by and having and distinct so generally speaking when I read an SQL query I start sort of along these lines I look at the from clause and the where clause first third and I'll say that when I was doing these solutions to your next homework last night this is exactly how I wrote them I said select and then I hit return in my editor and started doing the from and the where and then it went back up and did the select and the group by's and all that stuff later so this might be a useful flow to think about when you're thinking about your queries find the cross product apply the selections project away any columns that aren't needed keeping only those remaining right having form your groups and aggregate so that the aggregates in the select list use having to eliminate any groups that don't qualify use distinct to eliminate any duplicates in the output and with that let's start looking at some queries we did this last time in relational algebra today we're going to do it in SQL and they're going to get progressively more interesting as we go find sailors who've reserved at least one boat well it's going to be a select from where block okay and it's going to involve the sailors table and the reserves table right right now what are we going to join on SID right there's an SID in sailors there's an SID in reserves it's that key foreign key relationship right where reserves references the primary key of sailors right and so what we've got out now are pairs of sailors and their reservations clearly any sailor that has a reservation has reserved a boat so we're kind of done right we just need those SIDs that list of SIDs are sailors who have reserved a boat at least once maybe more than one would distinct make a difference in this query if we said select distinct shall we try it I'm actually not sure in the database we have whether it'll make a difference or not but let's try it so let me copy the query oh this is the one that doesn't copy nicely alright we won't do that one this is the only thing that I formatted to copy because I wanted to do this silly animation do duplicates make a difference anybody could we possibly have duplicate sailor IDs in this query good so sailors dot SID is a primary key so each sailor can only appear once that might lead us to believe there's no duplicates that's good thinking do you all agree no duplicates possible in this query I see some heads shaking no you want to fill us in while you're shaking your head no right we're going to join up that sailors SID with one of possibly many reservations for that sailor leading to multiple records in the output in essence when you do a key foreign key join the table with the foreign keys that is the reserves table in this case is going to dictate how many copies of each key appear there's going to be one output for every reservation in this case because the sailors are just a look up in essence you look up the detail per sailor but the reservations could have duplicates on the SID so this could have duplicates and distinct would indeed make a difference and I very strongly encourage you to go play with this in the database and insert reservations for multiple sailors and you'll see that you get multiple SIDs out for those sailors okay a note about range variables we actually need them sometimes they're not just for convenience because sometimes you want to join a table to itself and then you need to be able to have two different loop variables if you want to think of it that way they're called range variables for the table so here's an example what are we doing we're joining sailors to sailors one version of sailors is called X the other is called Y where X dot age is greater than Y dot age and we want the name of the X sailor and the age of the X sailor the name of the Y sailor and the age of the Y sailor what's going on here what is this these are pairs of sailors where one is older than the other right it's sort of the dominance of sailors and the output of the query well this query we can do in the database so let's just have a look at it here we go right Jim is older than Fred Jim is older than Nancy and Nancy is older than Fred okay that's all perfectly reasonable but the point of this all was sometimes you do self joins and in your homework you will definitely be doing some self joins very natural thing to do particularly when you try to find relationships between objects within a single table alright in some sense what you're doing in all these cases is forming an edge in a graph right this is the older than tree if you will right or dag partial order so we're forming an edge between one sailor and another sailor where one is older than the other okay you can put arithmetic in your SQL both in the select clause and in the where clause alright and if you're bored and you don't have a calculator handy and you need to add some stuff up SQL actually allows you to just type at it without a table name come on where to go it's truly annoying okay well here it is I don't know how it chooses whether it's in mirrored or not but you can say select 100 times 3 and it'll give you the answer alright you can say select mmm I don't know there's a whole bunch of functions there's a whole expression language but it's also a sign of 1.2 etc okay so if you don't have a table you just get a default one row table as your input and it'll essentially generate you an output and you can do this in the where clause as well you can say select star from sailors where you know age times 2 equals 4 and I'm sure we'll have nothing in it alright you get the idea so SQL kind of predates the widespread adoption of regular expression standards so it has its own annoying pattern language and a comparator called like okay so this is sailors whose name is like bob or borb or anything else you can put between two B's okay underscore is like the dot in regexes it's any one character percent is like the star in regexes zero or more arbitrary characters and other syntax in most systems for using regexes instead of like patterns okay and Postgres has that most systems do I forget what it's called you can look it up so for regexes to this weird SQL like you should do that and frankly regexes are better like is pretty limited but it's there you can do string comparisons alright now we may remember this query from relational algebra alright let's look at a little Boolean logic find SIDs of sailors who preserved a red or a green boat so here's one way to do it select from boats and reserves join on boat ID so for every reservation look up the appropriate boat and boat.color equals red or boat.color equals green so this should work let's just double check that it does work if I can copy and paste there's zero rows that's rather disappointing in the sample data there's no answers but it is the right query alright here's another way to express the same query find the red boats, find the green boats and union them and the union syntax and SQL is just you put a SQL block above it and SQL block below it note by default union in SQL removes duplicates if you want to preserve the duplicates like there's two here and there's five here and you want seven in the output you say union all ALL by default it will remove duplicates find SIDs of sailors who've reserved a red and a green boat you remember this from the relational algebra expression right so this is going to find all the reservations it's going to look up the boats and then for each boat if its color is red and its color is green it will be in the output except there is no such boat no boat can be both red and green so instead you want to do the intersect query right that's equivalent and this will get you sailors who've reserved a red boat sailors who've reserved a green boat and the intersection of their SIDs there's yet another way to do this in SQL we could do it with a self-join alright so let's see how this looks what I want is pairs of reservations R1 and R2 such that R1 is for a red boat R2 is for a green boat and both reservations are from the same sailor right so that first clause makes sure that both reservations are from the same sailor the second and third clause is in the where just connect the boat details to the reservations b1.bid equals R1.bid and the last clause checks that the left hand boat b1 is red and the right hand boat is green and then we got an output which could have had the pairs of red greens for each sailor but we just projected it down to SID and we just got out the sailor IDs so this query in some sense does a little bit of extra work it actually pairs up those those red green pairs that are evidence of such sailors but then we throw those pairs away and we just keep the SID yeah actually this will do a bunch of duplicates right anytime there's a sailor who has let's say a sailor let's say bob rented one red boat let's say bob rented two red boats on two different occasions and three green boats how many times will bob appear in the output six times with every red green pair that's evidence that he rented both so good question this will definitely have lots of duplicates it will have kind of almost per sailor cross ID style duplicates cross product style duplicates excuse me make sense so if he had two green reservations three red reservations he's got six pairs of red green and again I encourage you to punch in some sample data into Postgres and give it a try there's nothing like hands on experience alright not in kind of detail so find SIDs of sailors who have not reserved a boat let's take all the sailors that have a minus which an SQL is called accept alright and we'll find the sailors that do have reservations make sense now accept like union doesn't care about duplicates it removes them alright you'll notice that SIDs only appear once in the top but we said earlier that that bottom query could generate duplicates in the SID if there's multiple reserves because we just said accept it doesn't matter it doesn't matter how many duplicates there are on the right hand side if there's only one of each on either side and you keep things from the left hand side where there's no matches on the right there is an accept all in SQL so you can take this query and you can replace it with accept all alright and I would encourage you to try that and see what happens populate it with different amounts make sure you have different amounts of reservations and different duplicates on the outside and see what happens so the multi-set cardinality how many copies there are of a record in these multi sets is something that's going to appear in all your SQL queries here's another way to say this sort of notion of things being in stuff so we talked about intersection now let's talk about a predicate called in there's a different construct in SQL typically called a sub query so we got an outer query which scans the sailors table and for each tuple in that sailors table we're going to run a sub query that's going to look up some stuff and then we're going to compare to see if the outer tuple is in the results of the sub query okay so for each sailor we can think about running a query on reserves for boat ID equals one or two and we'll output those sailors who've reserved one or two that way so the sub query is not joined in the sub query is checked for every tuple of the sailors table and in this case you can imagine that the smart execution of this executes the sub query exactly once and stores the output and then runs the outer query and looks in the output of the sub query for matches and you can do the same thing with not in which is a little bit harder to say with a join in fact I defy you to say it with a join okay so now we can find sailors where SAD is not in these are sailors who have not reserved boat 103 remember what I said about the relational algebra that set difference was the only non-monotonic operation the only thing where when a new tuple comes in the answer could actually get smaller so sailors who have not reserved a boat is non-monotonic if you get more reservations then you get fewer sailors in the output right and so not in is clearly not doable with just joins and selects and projects it needs to be some kind of set minus right so not in and accept are very related to each other they're both non-monotonic set minusy kinds of things in on the other hand usually can be rewritten as a join because it's truly monotonic right add thing more things to reserves you get more answers add more things to sailors you get more answers now here's a nested query that's a little more puzzling so here what we're gonna do is we're gonna it's a first of all another clause instead of saying in you can say exists okay so you can say select star from sailors where there exists a tuple in some query and if that query is non-empty then the outer tuple goes to the output so specifically I want sailors who reserved boat number 102 so let's look at this select S name from sailors where there exists something in the sub query what's the sub query? select star from reserves where reserved up boat ID equals 102 and look there's a reference to the outer query comparing against this inner queries tuples so in this case because of this reference to S within the sub query the sub query is kind of different for every tuple of the outer query you have to pass in the SID sort of like a function call into that sub query that sub query is parameterized by S.SID which means that you do really have to run it once per tuple of sailors assuming every tuple has a different SID which it does okay so this is not a case where you can pre-compute the sub query and just check its answers here you're really running the sub query once per tuple of the outer table does that remind you of a join algorithm we learned? for every tuple of sailors scan through all the reserves looking for a match what algorithm is that? that's a nested loop-striant it's a naive nested loop-striant it's a per tuple nested loop-striant this is a terrible execution strategy so a good database system will look at this thing and go I bet I can rewrite that as a join and use hash-striant or sort merge or at least page-oriented nested loops but the way it's actually expressed in SQL even though SQL is sort of supposed to be declarative it has a flavor that seems to imply that the binding values of the variable s.sid are delivered once per tuple of the outer table so most systems will rewrite this query into a join one way or another but here's a detail that you have to think about how many duplicates are in the output of this query well it's a query on the sailors table the sub query is just a condition in the where clause so the sailors table for every tuple of sailors it's either in the answer or it's not in the answer alright so there's going to introduce no duplicate sailors in the output so each s name will appear as many times as it appears in sailors alright so interestingly this exists clause doesn't actually generate an output tuple per match it just says are there any matches does a match exist and then an output is only one so the duplicate semantics for this is different than they would have been if it was a join if it was a join we would have had one copy for every reservation instead here we only have one copy for every sailor so unfortunately when you rewrite these things you do have to worry about the duplicates and you know respect what the user asks for so systems that rewrite this under the covers have to be really careful to make sure that the duplicate semantics are preserved alright a couple more set comparison operations to think about we've seen in and exists in the examples in the previous slides and let me like right up on the board the template of these things so the sub query template is you know select stuff from tables where you know some tuple variable so let's say some tables and one of them is as t where t dot you know value comparison sub query so this comparison is a comparison between a value something of type like integer or float or string and a set of tuples that's the type of these comparisons so the comparisons can be things like in value in table and that's true the table had better have only one column of the right data type right so if this is an integer this must be better be a table of one column of type integer same with not exist so there's not exist there's not an as well as exists an in and then exists doesn't have this at all right where exists it doesn't actually need an outer value here because it's not a binary it's a unary operator you can also say operator any and operator all so like equals any t dot value equals any value in the sub query which is a lot like exists but if you had greater than any it would be different than exists so you can have any boolean operator there so let's look at an example of that find sailors whose rating is greater than that of some sailor called fred find sailors whose rating is greater than any rating of a sailor called fred everybody good alright we're about to do a more interesting one so let's take a break everybody stretch hello alright so here's a hard query there's actually something I didn't bother to teach you cause it's annoying that we're going to see an sql but in relational algebra there's actually an operator called this it's a macro so it actually isn't required in the algebra but it's a weird little macro called division so relational division you can find all the things such that well I'm going to get this wrong off the top of my head this is a relational division query though and what we're going to do is we're going to see it in sql and you have to block it out in pieces to get it right seems like a harmless little query find sailors who have reserved all boats alright so sort of sailors divided by boats find sailors who have reserved all boats so here's how it's going to work this is actually the answer if a boat not exists let's go through what's going on here find sailors s that's the outermost query where there does not exist a boat b that is missing a reserve's triple showing that s reserved b it's a double negation and this is actually how you define division to find the sailors who have reserved all boats you find the sailors such that there is no boat b that they haven't reserved this comes up like this is the thing you might have to do now and again alright anyway this is fun if we did relational division you'd say wow yeah I've seen that one before but anyway here's how you do it in sql alright at this point we're going to sort of drop a little bit off of the edge of the relational unordered semantics it's sort of great a lot of the time that relational query languages are about sets and multisets because it means we can parallelize stuff and we can do things out of core because we don't care what order things happen in right so the order of the set at the output doesn't matter so if we want to like spill some tuples to disk and deal with them later that's fine because it doesn't matter what order things happen in if we want to split things up and run them on multiple machines those machines can run independently at whatever speed they want because it doesn't matter what order the answers happen in all that stuff is great every once in a while though you want to ask a question that's fundamentally about order like I want the most popular thing alright sometimes called the argmax query I want the thing that is the maximum alright so I want the sailor with the highest rating then you can worry about the ties thing before and you can do it in sql it's just kind of obnoxious so you can say I want the sailor such that their rating is greater than or equal to the ratings of all other sailors alright that is in fact the sailor with maximal rating here's another way to say it using an aggregate I want to find the sailor whose rating equals the max of ratings alright notice this is weird this is actually a use of equality with the subquery s that rating equals subquery so from a typed perspective this doesn't seem right this is a integer equals set of integers a really set of integer tuples seems broken and this only works because it's a max query with no group by so it has exactly one row in the output and so really it's just a tuple containing an integer which you might as well just unwrap and treat as an integer ok the group by in that select clause on the interior in the subquery this would be illegal and it would get rejected by the parser but because you can tell from the syntax of that subquery that returns exactly one row with exactly one column of the right type this equality is actually legal alright but here's the way you really want to run this query which is cheating from a pure relational context but it's the obvious thing to do if you're like a normal person which is find all the sailors sort them by descending rating and give me just the first one so this works this limit clause is there in sql it works it's supposed to be sort of just for display in particular you can't put a limit in a subquery ok so if you actually for some reason needed to find the maximal sailor and then do something with it you can't do that I don't believe in a subquery this is really only at the outermost level and it's supposed to be just for display because after all everything's just a set so order by should only be at the outermost level ok but this is the obvious thing you want to do lots of times you will very likely use limit clauses all over the place when you write queries in the real world sorry which subquery do you want to run top right select max s2 dot rating from sailors you can do that that query? make sense let's look at the sailors table to make sure we believe that 8 is the biggest thing in the rating column right are we good ok so this limit clause is a little bit of a hack in some sense it's not really robust you can't nest it inside of other queries ok here's another thing that's complicated about sql that kind of makes your head scratch a lot and it has everything to do with horrible stuff like is an in query the same as a join query nulls cause a lot of headaches but it was considered a useful thing and it's in many of your programming languages as well for every data type in sql which means I don't know it's uninterpreted so it doesn't necessarily mean missing it doesn't necessarily mean it's not any given value you can think of it as a value that hasn't been filled in yet if you want so there's a special null value in every data type for this but it makes it annoying when you're doing querying in a bunch of ways so first of all in your where clause if you want to check for nulls you have to say where value is null and where value is not null equality predicates will talk about in a minute but you can't say equals null cause null is not a value null is a missing value so you can't say equals null assume that you have a rating that actually is null so let's try this out if you have a predicate rating is greater than 8 what do you think should be going on here if I do that then I get the big one so let's insert into sailors values um s84 we will name this sailor papai and we'll give him a rating of null and we'll give him an age of 100 is it old sailor ok we want to know the question if rating greater than 8 applies to papai where a rating is greater than 8 how many of you would say that papai should be in the answer how many of you would say that papai should not be in the answer ok papai not in the answer nobody in the answer in fact ok so far so good null is not greater than 8 does that mean that null is less than or equal to 8 so you think it should be right if he's not in this query he should be in its compliment man he's not there either where's papai right there's a little weird this should cause you some pause because it's going to screw up a lot of your thinking about about negating predicates and doing de Morgan's laws and stuff like that right not exists something for all the negation of that thing may not work with nulls let's try it let's say we're rating equals null alright null is a value so that's syntactically legal but nobody's got the value null because null is not a value hahaha alright his rating is null I don't know what value it is it's not null it's also neither greater than or less than 8 so it's just unknown it's a little weird and it affects not only comparisons it affects other things in the language too so let's keep going with this and we can play with it as we go so rating greater than null false what about and or and not oh man let's draw some truth tables and truth table you guys have done truth tables I assume right except we've got a three valued logic now we got true we got false and we got null we got true and we got false we got null what's true and true what's true and false what's true and null let's hold that thought for a minute what's false and true false and false is false what's okay let's face it what's true and null it's null because look I don't know what it is if it's true then it's true so this guy turns out to be true then it's true and if this guy turns out to be false then it's false so it's null okay what about null and false it's certainly false it doesn't matter what this value is right because anything in false is false and null so this is false too and null and null is clearly null all right now you have to do this for or and not right not too hard okay what about sum what is the sum of ratings well I actually got bit by this doing either prepping for your homework or prepping for my talk at this conference this week but this is kind of a gotcha so select sum of ratings you know what's going to happen here right this should be easy from sailors what's the sum well 7, 2, 8 and null what are those add up to kind of seem to add up to null right that would make some sense but it would be too irritating if we did that because aggregations you could have 10 million rows in your table some of which people were too lazy to answer the value right you'd still like to see the sum for the rest of them thank you very much okay what about the average well actually okay select let's try this count star from ratings what's count star what should the answer to this be do you remember what's in table here's what's in the table oh no from sailors sorry here we go alright what do you believe the answer to this is how many people say 4 how many people say 3 4 cool good you're very smart what if I say rating how many people say 4 how many people say 3 haha okay you're beginning to see a pattern here so what would the average be the average is just going to ignore it doesn't affect the count or the sum so it won't affect the average right it's like the null was never there and that is the sequel semantics for nulls and aggregation you eliminate them before you do any aggregation you just take them out of consideration and the rule so that's the rule for aggregation you ignore the nulls before you aggregate the rule for where clauses for selects are you carry all your nulls through all your logic very carefully all your n's and your or's nested through your expression and then at the very very top if the final evaluation of the expression tree is where null we can just type it select star from sailors where null it's neither true nor false but it is a boolean value alright so null at the outermost part of the clause is treated as false so in particular you could say you know where rating well here's another one rating equals null there's no rating that equals null so here let's say where null and true right null and true equals null yeah alright but if you say select null and true it gives you back a null there's a row there alright it's got a null in it so it's in the select clause it'll pass you back this null but if you say where null it throws away tuples alright so the value null is fine it's just that if you put it in a where clause at the very top it evaluates to false alright this causes no end of difficulties when you are trying to do things like turn correlated subqueries into joins which we won't do in this class but that was my first job after college it was a big hassle we wrote a paper about it it was fun okay let's look at joins there's different kinds of joins you may have heard about this or asked about this at some point there's inner joins there's three flavors of outer joins and there's natural joins alright which I didn't even put on this slide so we'll just go through them the inner join is the join you're used to it's the theta join alright it's the default it's the kind of join we've always talked about okay here's three different queries that are the same the first one is in the syntax we usually use where the from clause just lists the table names and the join predicate is listed in the where clause the second syntax if you prefer you can say sort of infix notation sort of where you say s inner join r on and then you give the join predicate which is in the on clause of that so it's a infix to say inner join but it's postfix to say what it's on it's a syntax some people like it it's fine so it's equivalent to the previous one and then this third thing well we saw this in SQL right natural join it's gonna it's gonna happily form a where clause for us of all the matching column names and it's gonna throw away that extra copy of the column name so let's just do this in postgres just to see what it looks like real quick so here's the first version of the of this thing right traditional here's the second version with an inner join syntax alright and here's the third version with a natural join syntax alright they all give the same answer that last one though if we said select star oops I didn't want to do that I can't do that hmm it's not gonna work there we go let me try this one more time copy paste paste thank you go to the beginning no alright I might be able to do this it's too much work psqlcs186 select star from sailors natural join notice that sid only appears once as we talked about with the algebra sid is in both tables but there's really no reason to keep two copies of it in the output so you get all the columns select star but sid only appears once whereas if we did it select star with the inner join syntax you'd get sid twice so natural join is kind of handy if you happen to know the column names I don't recommend using it in your code though because it's not very robust to schema changes if somebody adds a new column to one of those tables it happens to match the column name in the other table kaboom your queries are going to start behaving really weird so I don't encourage you to use natural join I've actually never seen it used in the field but it's in the language alright here's some examples I don't know that we need to go through them this is the traditional inner join the interesting thing to do now though is to talk about outer joints so the left outer join is sort of the easy one to start understanding so what's going to happen is when we say a query like this we want sailors I give it a new table name which we're not going to be able to use unfortunately sailors s left outer join reserves are what I want is that if there's a tuple if there's a sailor there's a tuple in the sailors table that has no reservations there's no matches to it I still want it in the output I want to see that sailors id I want to see their name now they haven't reserved anything so you can't give me a boat for the tuples on the left hand side of a left outer join that is to say sailors here if there's matches it'll show you the columns from the sailors table and nulls for the columns from the other table okay so here's an example if we join these two tables together and I accidentally blew these tables away before class so I can't run this query for you but if you join these guys together you'll see that Dustin and Bob has a reservation but Lubber has no reservation and so the output's going to look like that yes well I didn't understand that I'm sorry let's try again okay or try it out you can insert some data in your database hold that thought you can do the same thing the other way right outer join if for some reason the thing you want to preserve is on the right hand side of your from clause infix notation that's fine here here if there are boats that have never been reserved it'll output those boats with null reservations right so boat ID 102 was never reserved boat ID 104 was never reserved so they're preserved in the output and as you might expect there's something called the full outer join which preserves on both sides of the join if there's no matches and so in this case we'll get the same answer we got the right outer join because there actually aren't any reservations excuse me there aren't any reservations for a non-existent boat if there had been reservations for a non-existent boat we would have the reservation ID and empty and nulls for the boat fields but we can't have reservations for non-existent boats because we declared referential integrity on this table we said if a reservation references a boat that boat must exist okay so in this particular example schema there's no way to have a reservation that doesn't match but in general the full outer join would preserve from both sides good outer joins are super common when you're exploring data like you have a bunch of information about sales transactions let's say and you want to find out for each one the information about the customer that goes with that transaction but maybe your customers tables missing some stuff that was in your sales transactions doesn't mean you don't want to see the sales transactions that means you have to say I don't know about that customer so it's very common when you're looking up a foreign key to do an outer join with that foreign key table because if you don't find a match in that foreign key table you still want to see the stuff you're looking up it's also very common when you're building web forms to be putting together data from multiple places wanting to make sure that the cells that don't have matches at least show up in the user's eyeballs at the output so a lot of times when you're doing a display or data exploration outer joins are very natural the other thing I'll point out is that inner joins are kind of nefarious they drop data sort of silently if you expect that for every tuple here there's going to be a match over here and there's not you might do stuff wrong like count up the results of the join or sum up some things in the join not realizing that you lost data over here so outer joins are really quite useful and I encourage you to think about in almost every situation that outer join isn't what you want alright now you sit down and you write all these queries and sometimes they're long and you might want to reuse them so there's a way to give queries names much like you give a block of code a function name in a traditional language you can give a block of SQL a view name they're called views in SQL so you can say create view you give it a name as and then you give it the select statement it's simpler because you can reuse code you wrote before and you can have modularity you can have little blocks of code that are encapsulated in a name and you can stop thinking about them they're often used for security because the authorization model that is to say who gets to access what in SQL is based on table names so and I'll show you an example of this in a minute an important note is that views by default are not materialized that means they're not stored in a functional language a function would be evaluated only at the time of use so the contents of a view are not stored if they are stored it's called a materialized view and some systems will let you declare materialized views that's got its own wrinkles that we won't talk about right now but for now assume that views are not materialized so here's an example of a view called reds it's taking from the boats table and the reserves table all reservations for red boats and then it's counting up the number of reservations for each red boat so you can think of it as the count of reservations per red boat some statistics about the red boats and on the next slide for whatever reason I changed it to a red count I apologize about that so it's up there at the top should really be called reds I guess here's an example of the output and then you can reuse it in queries you can take this reds on the red count table and you can join it against boats you can match on boat ID here and you can look for counts less than 10 so here I want to find boats that are red and have fewer than 10 reservations so I can use the red count for that what do I want to say before we go on am I going to talk about this later I think I am yeah I'll talk about it in a minute good here's another thing you can do for encapsulation if you don't want to define a view a view after all has a name and it lives around forever and other people could potentially look at it if you don't give it security maybe you're just going to use the query just once but you still want a little modularity you want to kind of tuck it aside from the rest of your query block you can actually put your view in line in the from clause so here's an example it's like it's a query it's in red and we're going to name it reds so it's the as clause here is giving it a range variable name and also defining a schema for it so you can have column names for it otherwise the second column of this which is count star might not be defined so you can throw queries into the from clause kind of on the fly if you want just put an as after it and then you can reference them below you can see in the where clause we reference reds.bid good? there is a third way to have these queries defined separately it's the with clause and it goes before your select so this is the typical it's also called a common table expression I'm not exactly sure why and so you say with you give it a schema as you give it a query and then you can say select a scoped view definition it doesn't persist outside the scope of this query but in the scope of this query that thing is called reds and with clause is actually a little more readable than this stuff I find these things are really hard to read because you get all that SQL and that's inside your SQL here it's sort of a little more linear you define your sort of views temporary views if you want and then you do your query with clause is also we won't do this exactly in this class recursively which is really confusing and so I won't talk about that right now but you can actually have a with clause define a table and then reference that table again in the select clause recursively so I won't talk about that right now but it's pretty fun maybe we'll do that as an advanced topic lecture okay but this is just the simple syntax for temporary views essentially you'll want to do this in your homework you can make you write a whole bunch of SQL in your next homework and some of it's going to be complex and you're going to want to use things like with clause to make it more modular alright SQL has a very simple access control policy which is part of the DDL which you can grant privileges to objects so in SQL everybody when you log into Postgres you are a user alright and there are other users in the system and you can grant them access to stuff and the privileges they can have and the objects they can have are pretty simple so the objects can be tables or views that's it those are really the only things of substance in the system so you can give people privileges on tables or privileges on views and the privileges can be select insert, delete references are they allowed to make a foreign key reference to this table that we promise to keep good alright or all which is all the above so the notion of grant is revoke okay so you guys revoke permission privileges that you've given before the with grant option is transitive you grant them this permission and you grant them the permission to grant the permission okay so that they can do it users can be users there's also a notion of groups in SQL much like in Unix so there's a way to give access to groups this seems kind of clumsy and arguably it is kind of clumsy the thing to keep in mind is that views are actually very powerful like with a view remember from the lecture on relational algebra here's a big table right it's got a million rows and 27 columns using a view I can boil it down to just one cell right I can create a view that selects for this row and projects to only this cell and then that's the only thing you get to see so I can get pretty fine-grained access control with views down to the cell level views that do things like summarize the data for example I could let you see the average grade on homework 2 for all students and I could grant access to that to all of you but you wouldn't have access to each other's grades you don't only have access to the average right so I can actually define access control on aggregations of data even though I don't define access control on the data you have to be pretty careful about that if you're trying to preserve privacy there's been a lot of fuffle about this over time but the standard example is suppose that I gave you access to certain data grouped by a whole bunch of things and then counted and the certain data was occurrences of aids because this I think example happened so occurrences of some disease that you might not want people to know you have for insurance reasons or whatever like county and if you can select this data set down to find me there's only one person in this county there's one occurrence of aids in this county and there's only one person in the data set in that county then you know who is the person who has the disease so you have to be a little bit careful that people can't figure out individual information from statistical information which in general is a hard thing to protect against that was a client sub the general point though with SQL is you can define these privileges on any arbitrary SQL you like including aggregations ok a couple more important useful topics that we'll do today before we get into the juicy stuff on Thursday constraints and then embedding SQL in programming languages ok there's the opportunity in SQL to define what are called integrity constraints on your data integrity constraint is a Boolean expression that every legal instance of the relation must satisfy which means that if you try to insert or delete or update the database in a way that violates the integrity constraint the database will reject your insert, delete or update so the database is going to promise to maintain these constraints for you ok so these are often very useful for ensuring application semantics like let's make sure that the sailor ID is a real key because if SAD is not a key a whole bunch of my app logic will break I mean I'm assuming this is the essentially the name or the unique address of the sailor and if it's not unique anymore things are going to kind of blow up so that key constraint that we put in that primary key clause that we put in that create table statement is a form of integrity constraint similarly the referential integrity constraint well we're going to do that in a second sorry there are some differences like the sailor name has to be a string alright let's make sure that's always true or age has to be less than 200 you can have constraints like that as well so you can have constraints based on some values that you understand in the domain so the types of the constraints that we'll look at the types of integrity constraints we'll look at domain constraints which are that the field values must be of the right type so when you create a schema if you have a column of type integer that's called a domain constraint the domain of that column is integer the database will always enforce that for you so SQL schemas kind of come with domain constraints primary key constraints we just talked about will be enforced by the database foreign key constraints are also enforced so if you say reserves has a foreign key to sailors that means you cannot have a reservation for a sailor that doesn't exist and the database will enforce that foreign key constraint from reserves to sailors that we saw early on in the first few slides of today's lecture here's the reserves table it's got an SID and a BID and a date here's the sailors table it's got an SID and a name logically when we said there was a foreign key from reserves to sailors and this was the foreign key what we were saying was that for every cell in the SID column here there's a valid sailor this is the primary key of sailors this is a foreign key to it and referential integrity is this foreign key constraint that this reference this foreign key must point to a valid sailor okay and then you can have general constraints which you can just define anything really and I'll show you that in a minute before we look at the examples where do these integrity constraints come from they come from the programmer trying to capture the semantics of the real world so the reason you built a database damn it was because you wanted to keep track of some stuff and that stuff mattered to you somewhere somehow okay and when you wrote down what the stuff looks like hopefully if you were careful you wrote down some rules about the stuff like your bank balance cannot be less than zero why? because that's how the bank works okay you cannot have two biological mothers why? because that doesn't happen in nature does the computer care? no the computer doesn't care you have to tell the computer if you give the computer a data set a person with two mothers what's the computer supposed to do well if you don't tell the computer that sounds fine if you tell the computer then it should do something about it so the note is we can check these integrity constraints on an instance I give you a database I give you an integrity constraint you can tell me is this database legal or not is there a person in here with two biological mothers if so it's illegal but you can't infer an integrity constraint by looking at an instance suppose I give you a database instance and everybody's first name is Bob that does not mean that first name shall always be Bob okay just because I give you a table where something is true doesn't mean it must be true integrity constraints have to be written down by a person with intent we don't mind them out of data you can mind them out of data which is interesting but generally speaking not correct it might be worth doing because you're curious what's in this data but that doesn't mean it's what you want okay key and foreign key integrity constraints are the most common ones you'll use and then more general integrity constraints will be supported too putting in your primary keys is super important and useful so you really don't want to get duplicates where you don't want them in your tables strongly encourage you to put in primary key constraints foreign key constraints are nice too sometimes they're a pain because the database enforces them when you don't really want it to enforce them you'd be perfectly happy to have a dangling reference in essence and you're annoyed that the database is not letting you okay so keys key constraints keys are a way to associate tuples in different relations right so here's enrolled in students it's a different example the enrolled table we have SIDs in the students table we have primary keys that are SIDs okay and the enrollment records are pointing in essence to the students table right that's a primary key that's a foreign key alright primary keys now let's be a little careful and we'll review this again next time but let's end with a little definition a set of fields we'll call it a super key if no two distinct tuples can have the same values in all its fields so any collection of fields that can serve as a unique identifier for a row is a super key now a set of fields is a key if it's a minimal super key just to say it's a super key and no subset of the fields is a super key you need every one of those fields to make it be a super key then it's a key not just a super key so super keys are kind of sloppy keys are you can't take anything away from okay can you have more than one key for a relation? I'm sure you can so for example in a students database we might have SID and email and we might have you know a pretty reasonable view of the world it might be used by two people so those might both be keys and one of them will just get chosen arbitrarily by the database administrator to be the primary key and they'll name it as such they'll say primary key SID the other key is referred to as a candidate key it's a key it's not just a super key but it's also a real key it's minimal it's just not the primary key it's another one of the candidate keys good key even first name last name is that a good choice for a key? it's not right there's other Joseph Hellersteins in the world and I'd hate to prevent them from taking my class for example there's very few though it's only like 180 people with the name Hellerstein in America I looked it up the other day it's exciting the set SID GPA you might want to be a super key it probably is a super key because SID is a key so SID GPA is a super key but SID all by itself is also a super key and it's a key because it's minimal make sense? alright I think we're out of time for today we'll come back on Thursday just a note I will not be here I'm going to be at the Strata big data conference down in San Jose Professor Franklin is going to sub for me he knows SQL pretty well so he should be able to pick up from here if you have questions and you feel that he didn't answer them satisfactorily let me know or let TAs know and we'll sort you out I will see you next week homework is being passed out tonight it will exercise your SQL talk to you soon