 So it's Wednesday night, the fight's over, Matt's banged up, so is the other guy. I decided that rather than me filming the second lecture all over again in my motel room, that it would be better just to show you guys what I did in 2018, because that was in front of a live audience and I felt that I could do a better job than what I would do by myself. So with that, that's why I'm re-showing the 2018 lecture here for the second lecture and then we'll be heading back to CMU and then starting Wednesday next week we'll have live lectures again, so take care. Today's lecture is on Advanced SQL and by advanced I mean going beyond what you may or may already know about basic SQL, right? It's 2018, SQL was invented in 1973, I imagine most of you have seen some SQL throughout your life, so I don't feel it's necessary to teach you the basics of it, so I wanna spend time actually talking about the more complicated or interesting things you can do with SQL, right? So just to pick up what we left last class, we were discussing the relational model and relational algebra and we sort of mentioned that with relational algebra, the goal was sort of at a high level to describe what the answer we wanted that the data system would compute rather than the exact steps of actually how to do it, so the way to think about this is like say we want to sort our data, if we have to tell the data system exactly what to do, we have to provide it with the quick sort of bubble sort algorithm, but with a high level language or declarative language we just say, hey, we want you to sort this, we don't care how you actually do it, but this is the answer that we want and we'll see this throughout today's lecture and then going forward throughout later in the semester, this is one of the advantages of using something like SQL or declarative languages, we don't have to tell the data system exactly how to do things, it can figure it out on its own and that frees it up to figure out what the most optimal way it is to actually execute the query that you wanna execute based on the data that you have and the hardware that's available to you. So we'll see this later on when we talk about the query optimizer, but the query optimizer is this complicated piece of machinery inside our database system that's gonna take our SQL query and convert it to the most efficient plan and we'll cover how they actually do this later on, but I will say is that if you get involved in doing research on query optimization or just help working on query optimizers, you can get a job yesterday because this is the one thing that all my friends at Davis companies email me about open over again, do I have any students that do query optimization? We've had some and they all go off and do awesome jobs. This is the one thing that everybody wants because it's really hard to do. So that's not the focus here, just in the back of your mind, we'll see examples as we go along where a query optimizer could try out different things to try to come up with different plans. We'll discuss how you actually do query optimization later on. So the history of SQL, as I said, goes back into the early 1970s. So you may notice that I say SQL, some people say SQL. The part of the reason I say SQL is because not that I was alive in the 1970s, but the original name of the language was actually spelled out S-E-Q-U-E-L, SQL, and this was invented by IBM as part of the System R project. So it stood for the structured English query language. So if you remember from the Ted Codd paper, I said that Ted Codd was a mathematician. He devised the relational algebra and relational model, but he didn't actually define the programming language you would use to write queries on it, and you can't write queries using relational algebra. It's very difficult to write that in the keyboard to do that. He did later define or came up with his own query language called Alpha, but that was much later in 1970. So back then when people said, hey, there's this relational model idea, we should actually try to build a system to do this. People had to come up with their own language that could implement relational calculus or relational algebra. So at IBM, they came up with SQL. This was part of the System R project, which is one of the first relational database systems that people were trying to build in the 1970s. The other major one was Ingress that came out of Berkeley. So you've ever heard of Postgres? Postgres was invented by the same guy that did Ingress. So it's called Postgres as in Post-Ingress, the thing that came after Ingress. So the Ingress guys had their own language called Quell, and this was developed by one of my advisors, Mike Sternbreaker. He claims it was much better than SQL and the IBM guys didn't know what they were doing. Of course, most of you have probably never heard of Quell, right? So it didn't actually win IBM won. So back then, again, it was spelled out S-E-Q-E-L. IBM later got sued for, I think, copyright infringement or trademark infringement. There was some guy in England that had the term SQL spelt out in the full English word for his programming language. So it is shortened it to be SQL. So what happened was the reason why we used SQL today, because IBM is not, back then IBM isn't what it is now, right? Everyone thinks the big tech companies like Microsoft, Amazon, and Google. Back in the 1970s and 1980s, IBM was the juggernaut. So essentially whatever IBM did or said this is the way we're gonna do it, that ended up being the standard. So when IBM first released their first commercial relational database system, DB2, so they never actually released system R, it's just a research prototype. But then they finally made DB2, released that. DB2 supported SQL, so that essentially became the standard. And the reason why Oracle sort of took off and got as big as it is today is they were sort of copying what IBM was doing in the 1970s in more ways than one. We can talk about that later. And they had SQL, so when IBM came out with DB2 and had SQL, Oracle was at the right place at the right time, said we support SQL too. So it became an anti-standard in 1986 and became an international standard in 1987. And now the short version just means the structured query language. So SQL, even though it's from the 1970s, it's not a dead language. It's not certainly static. It's sort of like C++, they keep releasing new specifications every so often. It's the same thing in SQL. Every so often there's a new specification where they add in new features, new functionalities to the basic language. So the latest standard is defined in SQL 2016. And you can see over the years, as they add new versions, they add new features. So in 2016, they added JSON polymorphic tables. They add XML stuff, and then 2003, 1999 added regex and triggers. Typically what happens is there's a standards body that the members are all from the major database companies. And the major database companies come up with their own proprietary features and extensions, and then they go in the standards body and push to try to get their version of certain functionality as part of the standard, right? So this is, although there is a SQL standard, nobody actually follows it to the T, right? Because everyone sort of has their own proprietary things that got invented before the standard came outside. This is how you should do things. So if you're gonna claim that your database system supports SQL, the bare minimum you need to have is actually what is defined in the SQL 92 standard. So this is what the basic SQL that we know about today, select, insert, update, delete, create tables, transactions, things like that. All that's defined in SQL 92. So again, if someone says their database system supports SQL, chances are they really mean this. And then the more advanced databases, both in open source and the commercial ones, they have a bunch more features from the newer standards. And then there's this great website here. It's a bit dated now, but it should take a look at. There's some random dude who basically looked at sort of the top four, top five database systems and just looked to see how they differ on various SQL features or their various SQL functionalities. And we'll see this as we go through today. There'll be some examples where the standard says one thing, different database systems do other things. It's usually my SQL and this is just gonna, even though there's a standard, nobody actually follows it exactly. There's no database system that I'm aware of that would claim that they're certified for SQL 2016. They have bits and pieces of things. So SQL itself technically is not a single language. It's sort of a collection of things. And in particular, it's a collection of a DML, DDL, and DCL commands. So DML would be the data-manipulation language would be the commands of like insert, update, delete, selects. Like the things that actually manipulate the data that you're gonna store in your database. The DDL is the way you create tables or define schemas to actually store things. And then the DCL is the way you sort of do security authorization to grant who's allowed to read what data. There's a bunch of other things like how to define views, how to define integrity constraints, referential constraints, as well as transactions. These are all part of the umbrella of SQL, right? And within that, there's these different categories of commands. So the one important thing that I wanna point out here, and we'll see this throughout today's lecture, is unlike in relational algebra, which is based on set theory or sets, SQL is actually based on bag algebra. So the way to sort of think about this, you can have lists, sets, or bags. So a list can have duplicates, but there's a defined order. If I push something to my list, that's its position in that list. A set is unordered, meaning the elements don't have a position. But you can't have duplicates. If I try to insert the same thing into a set, the old one gets overwritten. A bag has neither a set position or ordering, but it also allows for duplicates, right? And we'll see why we have to do this as we go along. Because if we wanna actually define order on our elements, or if you wanna make sure that we don't have duplicates, essentially the database system has to do extra work to provide that for you. And so the idea is that only if you explicitly ask the database system to provide you ordering or provide you to remove duplicates, it won't actually do it. And this will actually make things be more efficient. All right, so the outline for today is we're gonna cover aggregations, group buys, a bunch of operations on strings, dates, and times. Then we'll have to do output control. And then the more complicated things will be nested queries, common table expressions, and window functions. So for homework one, you'll need to use all of these except for the window functions, right? Cuz you're gonna be doing homework one on SQLite. And only actually the latest version of SQLite as of last week, just added support for window functions. But everything else, SQLite should be able to support. Okay, all right, so for this, we're gonna use a sample database, comprised of three tables. So it's sort of a mock university. So we're gonna have a student table where students have student IDs, names, logs, and GPAs. We'll have a course table with course IDs and name. And then we'll have an enrolled table where we have a sort of foreign key reference from the student table and the course table. And along with the grade that the student got in the class, all right? We'll use this as a running example as we go along. All right, so the first thing we talk about are aggregations, right? And these are pretty simple to understand. It's basically a function that you define in the output list of your select statement that's gonna take as input multiple tuples, a set of tuples. And it's gonna compute some kind of aggregation on top of that and produce a single result, right? So the SQL92 standard defines average min, max, sum, and count. And again, think of this as like, in case of count, you're gonna take a bag of tuples as the input. And you're gonna count the number of tuples there are. And you're gonna produce a single output that has that count, right? So again, so this is what's in the basic standard. The later version of the standard and in other database systems, they'll have other things like median mode, standard deviation, right? They'll have different kind of aggregates and some of them actually allow you to find your own aggregations as well. So let's look at an example like this, right? So say we want to just count the number of students in the student table where the login ends with at CS. I'll cover what like is means later on, but essentially it's just looking for a wild card here, right? So the first thing, the most important thing to remember about aggregations is that the aggregation can only appear in the output list of the select statement, right? So I have it here in this as saying this is what I want to produce at the output. I can't have it in these other parts here, right? Cuz it doesn't really make sense, right? Cuz you're computing aggregation after you've sort of applied a filter to figure out what tuples actually match your wear clause. But the same point out here is that in the case of count, then we just want to count the number of tuples. The login field here doesn't actually mean anything, right? Cuz we're just counting the number of tuples. It doesn't matter whether what the login actually is at the point after we do the filtering. So we can rewrite this just to have a star, right? The star is a special keyword in SQL that basically says all attributes for the tuple. We can go even further and say we can actually replace the star with a one, right? Count the number of tuples by just adding one every single time, right? So this is a good example where we have three different queries that all produce are semantically the same to produce the same result. But the databases could choose different flavors or different variations of this in order to derive that answer. And some of them may have different performance differences. This one's pretty simple. So most of the systems we smart to realize I don't need to copy around the login here, I can just replace that with a one, right? Most of them will do that. But for more complicated things, it may not always work. We actually can combine the multiple aggregates together in a single query. So say for this one, we want to get the number of students and their average GPA where the login ends with at CS, right? So now you see I've combined average GPA and count in my output list for my select statement. And then it'll produce my result like that, right? Pretty straightforward. I can also add the distinct keyword to tell it to only count the distinct elements or values of attributes for my tuples. So this is saying count the number of students with unique logins from the student table where the login ends with at CS, right? And so I have the distinct side of account. Now this one's sort of nonsensical too in some ways because presumably no two students can have the same login account, otherwise you'd have problems. But in other cases, you can apply the same kind of thing for other scenarios and it would work the way you want it to work. And in this case here, we produce the same result. So the one thing that now you want to maybe want to try to start doing is now that I'm starting doing aggregations, I want to get additional information about my data outside of just what I'm computing in my aggregate, right? So say that I want to get the average GPA of the students that are rolled in the course and I want to know what that course ID was, right? So in this case here, I've now added the course ID to my output list outside of my aggregation, right? Do you ever think I guess what would happen here? Would this work or not? Raise your hand if you think it would work. Raise your hand if you think it wouldn't work. A few of you, why? Correct, yeah, so what he said is there's not a single course ID for all my tuples that I'm computing my average on, right? It's all the students are taking all the different classes. What course ID do I actually put as the output, right? It's the SQL standards say this is actually undefined and most systems you actually will get an error with this. We can actually test this, right? So in this, here we go, I have three terminal setup, right? So this is running a machine back in my office. I have three panels. The top one is Postgres, the bottom one or the middle one is my SQL and the bottom one is SQLite. So it's much easier for me to type from this machine here so I can log into that, right? Okay, so the query was we wanted to select, where was it? Let's get the average GPA of students enrolled in each course, right? So select average SGPA, E course ID from enrolled as E, student as S, where E student ID equals S student ID, right? So Postgres says you can't do this because as he said, the course ID, and I'm highlighting here so you can't see it. So the course ID is not defined, it's not part of the aggregation, right? So it doesn't know which course ID you actually want. So now if we go down and try this in my SQL. My SQL gave us an answer, right? But is that correct? No, right, because what course ID did it pick? It picked a random one, right? And then now we can try in SQLite, same thing. SQLite gave us a different course ID. So you see they both computed the correct average, but they chose a different course ID, right? And just because I know that my SQL guys watch these videos and complain, I will say that. So this is running my SQL, so my SQL traditionally allowed you to do lucy PC things like this, right? So this would be running at what they would call traditional mode. But you can set the SQL mode to be more strict. So now if I run that same query, it throws the same error that Postgres did. So by default, these MySQL 5.7 would now throw errors. In older versions, they didn't do that, right? Okay, so the way to fix this is to do GroupBy. So with GroupBy, what's gonna happen is now we're gonna find how we want to essentially bucket together the tuples in our output based on one attribute, and then now we can then compute the aggregation on the tuples in each bucket, right? So again, what I wanted to do was get the average GPA per course. In my GroupBy clause, I add the course ID. What'll happen is, when I first execute the query, this is essentially what I see after the join. But now with the GroupBy clause, I'll combine together the tuples based on the course ID, and for each of these, now I'll compute my aggregate, right? And now I can group them by course ID and that corresponds to the buckets I generated from before, right? So we have to use GroupBy if we want to extract information about the aggregations we're computing, right? And so just like before, anytime we have a attribute that we want to appear in our output select clause, it has to appear in the GroupBy. So here, the student name, we're trying to throw that in our output clause. It won't work because it has to be in our GroupBy clause. This one is sort of, again, nonsensical. It's technically correct SQL, it'll produce the result, but from a human standpoint, we know it doesn't mean anything. What does it mean to group by course ID and then the student name to compute the average GPA, right? The student's GPA is going to be their GPA. Okay, so now you may think, all right, well, maybe I want to do some additional filtering on my aggregations so that I don't look at, I don't produce every single result as part of the output. So you may think that you want to put it, use the aggregations in your wear clause, so here now I'm computing the average GPA, but I'm adding in my wear clause something that says filter out anything where the GPA is not greater than 3.9. This is only trying to look at the courses where the average GPA is greater than 3.9, right? So this doesn't work because we can't access anything in our aggregations in our wear clause because we don't have them yet in our wear clause, right? So again, the way to think about this is the wear clause is filtering the tuples as we go along, and after we do our filtering, then we can actually then compute our aggregation. So we can't compute, we can't use an aggregation to filter tuples because we haven't computed the aggregation yet, right? So the way to get around this is use the having clause. It's essentially like a wear clause, but now you can reference anything in your output list. So I've taken my average GPA and I've alisted it to AVG GPA, and I can reference that down here. And this will produce the answer that I want, right? Compute aggregation is just like before, and then it applies additional filtering on it. So now, not in this example, but there's other examples where you can do certain optimizations in the system based on what you know is in your having clause, right? And this is the great thing about declarative language. You know everything ahead of time, what the programmer, the person wants in their answer, so you can use hints about what you know is going to come later in the query plan to help you do certain refinements or reduce the amount of work you have to do as you run the query. So instead of using average GPA, say I wanted to do a count where the number of tuples are less than, for each group, less than some value. So as I'm going along, as I'm computing my aggregate, if I then recognize that, oh, my having clause says, filter anything less than 10 tuples, if I now hit tuple number 11 for a group, I know that I never need to count anything else for that tuples that come later because it'll never pass that having clause. So I can just throw away those tuples as I go along because it's wasted work, right? If you're writing this in a procedural language like Python, where you had to write these steps by yourself, the data system's not going to know what's going to come later on because it can't peek ahead, right? But in a declarative language like SQL, we can do that. So the next thing we want to talk about is how to handle strings. So this is a table that sort of summarizes the different variations of how strings are handled in different database systems. So the SQL standards best says that all strings, like so barchars, chars, text fields, they had to be case sensitive. And the way you declare them is with single quotes. And most systems follow this, right? The red herrings of the oddballs are both MySQL and SQLite. So in SQLite, those strings are case sensitive, but you can have both single and double quotes to demarc them. In MySQL, it's case insensitive, and you can use either single or double quotes. Now, my problem is that when I first started using data systems back in like 2000, I was using MySQL 3, and it's ingrained in my memory just to use double quotes because that's what we used back then. So I'm always had to correct myself every time I switch over to another data system to always go back and use single quotes. MySQL again has the mode you can say to follow the standard a little bit better, and they then enforce the single quote requirement. But I think by default, you don't get that, right? So in the SQL standards, say you want to take the name of Kanye and have it match with Kanye with mixed case. In the SQL standard, you have to uppercase, use the upper function to make this match work. In MySQL, you don't have to do that, right? Because all the strings are matching or case insensitive. Now, I showed this before how to do light clause, but thankfully it allows you to do the first string matching, right? So light is basically doing, trying to match some wild card in your string with another string. So for whatever reason, instead of using star, like most other things, they use the percent sign. So percent sign means any one or more strings, and then the underscore means exactly match one character, right? So say I want to get all the course IDs from the roll table that start with 15, so 15, 445, 721. I would use the percent sign like that and say if I want to get all the students where the login ends with at C and then some wild card, I would use a single underscore for that. So the SQL standard also defines a bunch of string functions. Again, the standard things you would expect, like substring, upper, lower, trim. These are all part of the SQL 92 standards and most systems will have this. But of course, there's gonna be a bunch of other stuff that are proprietary in each different database system. So the thing to point out, unlike aggregation functions with string functions, and mathematical functions and date functions, they can appear anywhere in your queries. So they can be in the select output list, they can be in your predicates. They don't always have to be in the select portion. They can be in your having clauses. So here we have the first query is gonna take the name and take a substring, just get the first five characters. And then the second one we did is we showed before of doing matching where the uppercase version of the student name begins with K, K, a, n. So again, string functions are for the most part, the basic ones will be pretty much standardized across all systems. Where things get weird as when you want to start doing concatenation. So the SQL standard says that you use two double bars to concatenate two strings together and most systems follow this. In my opinion, I would say Postgres and Oracle are probably followed the SQL standard the best set of all the systems. SQL Server and DB2 are probably next. SQL Lite then probably my SQL is the worst. So you want to say to concatenate the student name and add CS to it. Use double bar in the SQL standard in SQL Server from Microsoft. Use the plus sign in my SQL. They don't have the plus sign. They don't have the double bar. They only have this concat function where you basically define as the input parameters the things you want to concatenate together. They have something else that's kind of weird too where you can actually have, if you don't put anything between two string literals, that'll concatenate them together. So to show you what I mean by that, so here we have my SQL. So I can do something like this, an, dy, space, pavlo, right? Nobody else does this as far as I know. Let me double check that before I lie to you. Let's try it in, yeah, Postgres doesn't like it. SQL Server doesn't like it either, right? Only my SQL does that. All right, so now where things get really, really bad is when you have date and time functions, right? So a date basically just records the time stamp without the time. Time stamp is the time. Now they vary on the granularity of the time they may be tracking. Sometimes it's in seconds, sometimes it's in milliseconds or something even more fine-grained than that. But where things get wonky is how you actually start manipulating them and extracting them, extracting information from them, right? And so the syntax for all these different systems can vary wildly. So for this example, what we want to do is I want to show you how to do what I consider a seemingly easy function or easy operation. Get the number of days from today to since the beginning of the year. And we'll see how to try to do this on these three different systems, okay? So the first thing is that we want to get maybe what the current time is. Yeah, I just killed my Wi-Fi back, okay. All right, so the SQL standard defines a now function. Can everyone see that or no? In the back guys, can you see it? Okay, cool. So you call now and it gives you the current timestamp. So we can do that in MySQL and we can do this in SQL Lite. But you can't do it in SQL Lite, all right? So there's another way to do this. They have another function called a current timestamp, right? Except Postgres doesn't have it. But they have the keyword current timestamp. MySQL has the function and it has the keyword, all right? SQL Lite, who says it has the function? Raise your hand. So who says it has the keyword? All right, it's mixed, all right. They don't have the function, they have the keyword, all right? Again, simple thing like what's the current time? It varies, all right? So now again, what we want to do is we want to count the number days since the beginning of the year, all right? So as a building block, what we can try to do is maybe just get what the number days is from today. So in this query, what I'm doing is I'm taking a date. Listen, here's a string, right? So I'm taking the string for today's date and converting it into a date. And then I have the extract function, which will extract out the day field of the date, right? So as expected, you get 29, all right? We can try that in the other guys. They give you 29, they don't have it, all right? So, but now we can see that, all right, well, if I can cast a string for the date into a date, what happens if I just subtract them, all right? So I'll take today's date and subtract it from the date from the beginning of the year. And in Postgres, we get it, 240 days, right? My SQL, 728. I don't know what that means, right? Like, it's not like 240 times 2 or some, you know, some multiple of it, right? It's a number, right? It ran, but it's not what we want, right? So now let's try it in SQLite, SQLite gives us zero, okay? So it works in Postgres, so now we gotta figure out how to do this in my SQL. Well, so what we can do is we try to extract the day from, so maybe we take what we, that 728 we had before, right? And we extract the day from it, see what that means, right? It came back with 28, so that's not really what we want. So the way to do it actually is a bit complicated. What we're gonna do is we're going to convert the dates into Unix timestamp, which is the number of seconds since the Unix to epoch, which is like some January 1st, 1970, right? And now we're gonna have the number of seconds from the current date, the number of seconds is beginning of the day, we subtract them, right? And that gives us the number of seconds between now and the beginning of the year. And then we're gonna divide that by 60 seconds times 60 minutes times 24 days, right, we got 240, all right? Turns out though, after I did this the first time, there's actually an even easier way in my SQL. They have a simple function called date div, and that produces the same answer, all right? SQLite, let me try to make this a little bigger, there you go. In SQLite, they don't have date div, you can't subtract dates with each other. The way I figured out to do it was convert the current timestamp into the Julian calendar, which is the number of days since 437 BC, or 4,370 BC, and so you subtract the number of days since that time, the number of days beginning of the year, and you get roughly 240, right? And we can cast it to an int, and we get 240, right? So these are three super, super widely used database systems that all differ on some basic functionality of doing date and time, right? Actually, a quick show of hands, who think of these three ones, which one do you think is the most popular, the most widely deployed database system? Raise your hand if you think my SQL, about a quarter. Raise your hand if you think Postgres, even less. Raise your hand if you think SQLite, even less. The answer is SQLite. So a few years ago, we had Richard Hipp, the inventor of SQLite. So SQLite is amazing. It's written by three dudes, right? He approximates that it's been deployed on 10 billion devices. Everyone here who has a cell phone, that is in a flip phone or an old person phone, is running SQLite on it right now, right? A lot of desktop applications, like Photoshop and Illustrator, they run SQLite on the inside, right? SQLite is everywhere. He said that also too, every single AOL CD, if you know what that is, back in the early 2000s when the internet was sort of new in the US, there was a company called America Online, and they would mail everyone CDs for 10 hours free on the internet, right? Every CD that they mailed out, hundreds of millions of them, had SQLite running on it, right? SQLite is the most widely deployed data system that's everywhere. And here's the most crazy part, it's public domain. He gives it away for free, right? I mean, Postgres and MySQL are open source, but it's MySQL. Oracle owns it, Oracle owns the copyright. There's no copyright on SQLite, right? It's an amazing piece of software, okay? All right, so again, the main takeaway here was that simple things are hard to do, because there's no standard way to do things, even though there is a standard specification. All right, so now maybe what we want to do is instead of having the, you know, in my examples, I had the terminal open. Every time I ran a query, the output got printed back to me in my terminal. But maybe what you want to do is keep all the data you generate from a query, keep that inside the database system so you can use it in subsequent queries, right? This is way more efficient if your query is generating a lot of output. You don't have everything sent down to your laptop, and then push it back up to do more queries on it. So you can do output redirection to tell it to say, hey, don't print it out to me, write it out to this location. So one thing you can do is you can take the output of a query and you can write it into another table, right? And so in the SQL standard you can use into, and this actually will create the table for you on the fly. So whatever produces the output of the query, right? Again, this is declarative. The data system already knows what the schema is of the table, so it knows in this case here what the type is of this output is going to be. So it knows how to define a table that has that, that can handle those types, right? So into basically takes the output of select and writes it into a table. In my SQL, you have to use the create table, and then inside of it you define your select statement. You can also have it output data into an existing table, and for this you use insert into, which looks a lot like the create into, the create table from before, and then instead of having the value clause, you actually just have a select statement just to tell it, hey, get this data and write it into here. So then the important thing about this example versus the previous slide, this is about writing tables into, tuples into tables that already exist. So that means that whatever the type, the number of attributes and their types that are produced by the select statement, whatever table you're writing it into has to match that, right? If the select statement has four columns and your table you're writing into has three, the data system will throw an error and says, I can't write into that because the attributes don't match up. Now where things get weird is when you start having constraints that are on the table you're trying to write into and the select statement actually violates those constraints. So let's say that I have a primary key on my table that says I can't have any duplicate student IDs and then my insert query here tries to start inserting duplicates. Some data system will throw an error immediately as soon as it sees the duplicate and no tuples get written. Some of them will just keep going and ignore the ones that failed. Some of them will insert the ones that succeeded and just ignore the ones that failed or maybe just crash right away, right? So again the SQL standard says this is the syntax to use but how the systems actually implement it will vary widely. In addition to output redirection we can also do output control. To memorize said at the beginning, SQL is based on bag algebra, meaning it's unordered. But there's many times where you want the ordering in your output clause. And so to do this you add the order by clause. So you're just basically specifying how to sort the results that are being generated by this query. So in this example here I'm going to sort the tuple, sort the tuples from the enroll table based on their grade. And by default, even though I don't specify whether I want ascending or descending, the default in SQL is that you get ascending. You get output like this. But I also can add additional attributes to my order by clause to do more complicated things. So in this case here I want to do my order by by grade in descending order and then after that I sort them based on the student ID in ascending order, right? And I would get output like this. Now the important thing to point out here also too is that unlike in the group by clause where any attribute that I wanted my output list had to appear in the group by clause in order by you don't have that restriction, right? It actually goes both ways. In this case here I'm sorting by the grade but the grade isn't part of the output, right? It doesn't matter, right? It knows how to find the data that it needs as it's processing the query and do whatever sorting operation that you want to do on it. I can also do more complicated things. I can put any arbitrary expression in my order by clause as well. So I can do order by one plus one, right? It knows how to handle that. Another common thing you want to do is limit the number of tuples that are produced in your output, right? And of course there's a limit clause for this. You basically specify, it says, for all the results of my query only provide me back some number of them, like 10, right? And this is very common for things like, say search results, you'll see it show just 10 and then you click the next button to go see the next 10. This is being controlled by limit and actually offset. So limit just says, limit the number of tuples. Offset is going to tell you that at what offset of the number of the tuples you're producing as your output should you skip before you start figuring out how many you should limit. Again, if you think of a webpage to show you 10 results, you click next and see the next 10, they're using a limit with an offset to make that work. Now, because it's unsorted again, there's no guarantee that when you click next, if you're using an offset, you're going to see, you're definitely going to see different tuples because that's another invocation of the query and the results may be different in different order of the second time. So in this case here you would combine that with an order by clause so that you're guaranteed to go see the first 10 followed by the second 10 and so forth, right? And there's optimizations you can apply whether or not you have an order by clause if you have a limit clause. So if I don't have an order by clause and I have a limit, I know that as soon as I see 10 tuples, I'm done. I don't need to go look at anything else. If you have an order by, then you have to see everything ahead of time to sort it, then you can apply your limit in your offset. All right, so again, I consider that somewhat basic SQL. In previous years, I actually skipped all this but I think it's important maybe to go over it a little bit because you have to understand this for the homework, but now we want to talk about the more complicated things. So the first thing we want to talk about is nested queries. So the way to do that nested query is basically allowing you to specify queries inside of queries, right? And you can take the output of one query and use that as the input of another query, right? So a simple example like this. I'm doing a select. I want to get all the names of the students that are enrolled in these one course. So I have my outer query is defined based on the student table. And then inside of that, I have an inner query that's going to get the student IDs from the enrolled table, right? So we could write this as a join. This is sort of another way to actually do this. And in actuality, when it comes time to actually implement this inside the system, most query optimizers will try to rewrite this as a join, right? Because the worst way to actually execute this is to essentially have two for loops where you loop over every single tuple in the student table. So for every single tuple, you then re-invoke the same query over and over and over again. My SQL used to do this. The more primitive database systems used to do this. But the right way to do this is actually just rewrite this as a join, right? So you can sort of think of this inner query as like a function that can produce some set of tuples as your output, and then you can apply whatever predicate you want on that in the outer query. So we'll walk through a bunch of examples of this and see how it works. So the first example you want to do is maybe you want to get the names of all the students that exist that are enrolled in 15.445. So the way to construct a nested query is I think it's always important to start maybe with the outer query and think about what's the actual answer you want to produce. What are the actual attributes you want to see? And then you worry about how you're actually going to filter them and get what you want. So the outer query, we know that we want the name from the student table and then where we get those values we'll figure it out. So in the inner query we can write it in as English as the student ID and the set of people that take 15.445, right? So for that part we know how to write that query pretty easily, right? We just filter out all the tuples from the enroll table where the course ID equals 15.445. So now the question is how do we combine them with the outer query with the inner query. Again we can rewrite this very easily but for our purposes here we want to see how to do it as a nested function. So for this we can use the in operator and we do a matching on the student student ID. So the way to now read this is that for every single student in the student table so this first student ID is matching up with the student table I want to see whether there's a match of that student ID in the set of all student IDs that are in the enroll table that take the course 15.445. So we execute the inner query we produce the set of all student IDs and then for every single tuple in the outer query we check to see whether it exists in that set. Right? This shows the stupid way to execute this is for every single tuple in the outer query we execute the inner query over and over again. Right? That's stupid because we only need to produce once and then we can reuse it for every single tuple in the outer query. So I showed how to in that example I use in. There's other operators you can use. So all basically says that every single tuple that's in my inner query must satisfy my predicate. Any means at least one of them needs to. So in is essentially the same thing as equals any. Is there any tuple that it equals what my predicate tuple attribute is checking and then exists as at least one rows return. I don't care what it matches I just care whether a tuple got produced as a result. So we rewrite our example before getting all the students in 15.445 like this instead of using in we say equals any so now you read this as the student ID from the student table check to see whether it equals any tuple that exists in the set of student IDs that are produced by from the enroll table. Right now the inner the nested queries don't have to only appear in the where clause they actually can appear anywhere. So I can rewrite the same query like this where now I have my nested query in the output of the select statement. So this is a good example this is essentially reversing what I did before in terms of what tables we're going to access. So now the way to read this is that for every single tuple in the enroll table where the course ID equals 15.445 I'm going to then do a matchup in the student table where the student IDs are the same. There's essentially doing a join inside my output of my select statement because now this student ID is being referenced here from the student table and that student ID is being referenced in there. This is another good example again essentially reversing the order of how I process my tables and they produce the exact same result they may have different performance characteristics based on what my data actually looks like. It may be the case that this is actually faster to go through the enroll table first instead of the student table and therefore we can rewrite it if you wanted to choose one versus the other and a good optimizer could do this for you. Okay? Alright let's look at something even more complicated now. Say we want to find the student record with the highest ID that's enrolled in at least one course. It seems pretty simple but let's think about how you actually do this. So the first approximation would be something like this. Select the max student ID along with the student name from the join enrolled and student table where the student ID in the enroll table equals the student ID in the student table. Will this work? Why? Why not? There's an aggregation function without a group I and we're referencing a column that's not in the aggregation. Alright? So again the SQL standard says that this shouldn't work. Let's find out. Alright so again we have I gotta just connect. Postgres at the top. So we run our query here so the standard it says exactly as they said. So you have student name appears but it's not a part of a group I so you can't use it. Alright? In MySQL, same thing it produces that error but if we now run it in what they call traditional mode so older versions of MySQL would do this now we run this query and we get an answer. Right? We have the max student ID and then we say that the name of that student alright? Let's try it in SQLite wrong one. It produces an answer it also produces 53.688 as the max student ID but it says that that student belongs to Justin Bieber right? And actually I don't know what the right answer is. Let's see here. Yeah so Justin Bieber is the right one whereas MySQL make sure I'm giving you the same data Tupac is actually the lowest so it got it completely wrong. Right? So this doesn't work in the SQL standard it runs in SQLite and in MySQL if we turn off that strict mode thing. So the way we can do this is a nested function is again let's build it constructively so we know that we want the student ID and the name is the output but it's the where clause that we have to figure out and this one basically says that we want to get a matching tuple that is greater than every other student ID that's in our table. Right? So we know the interquery should be basically the student ID from the enrol table we can be more sophisticated maybe put it distinct there but it's all the same but now we need to figure out how to match the student ID from the student and the outer query the student ID from the interquery and for this we can use greater than equal to all. It has to be greater than equal to because we want to make sure that we match ourselves and the student that actually is that does actually have the highest one. Right? We can rewrite this in other ways as expected, right? We can rewrite it with the in clause like that and actually compute the max student ID and the interquery right now basically this is saying match the student ID that is the max student ID produced from the enrol table. Right? We can go even further we can rewrite the inner guy to be like this. We can rank them into sending order and then do a limit one. So some systems will actually just rewrite this one to be a max anyway just scan everything, keep track of which one is the max and then produce as the final output don't even bother doing sorting. Right? Again same query rewritten in different ways. One more example we're going to find all the courses that have no students enrolled in them so we take our outer query as a select on the courses and then we know our inner query basically says we're going to find where they have no tuples in the enrol table. So for this one we want to use not exist basically says we don't want to match anything in our inner query and all we need to do that for the inner query is just just grab every single tuple and here now we're actually matching up the course ID and the inner query with the course ID from the outer query. So you can only do this in one direction in the outer query. If you're in the outer query you can't reference the inner query unless you pipe it out or redirect it to a table. So any questions about nested queries? Again they're very powerful. Many times you cannot write what you want to write in a single query without using nested queries. Yes? Can you think of an inner query as a nested for loop? Her question is can you think of an inner query as a nested for loop? Yes, but no. So we'll see for loop has a notion of like ordering. It's really a set. So all those operators like in exist, any those are just trying to say for the entire set of tuples that are in the inner query you can't check to see where any of them matches. You're not really iterating over every single one. You can think of the outer query as a for loop because you're iterating every single tuple but then the set portion, the evaluation of the inner query is always at a bag or set level. Makes sense? Okay, window functions. So window functions are I would say they're new but they're like 15, 10 years old now so they're not brand new. I don't support them but the major ones do. So a window function is sort of like an aggregation where you're going to compute some function on tuples but rather than doing it on a subset of the tuples and collapsing them down into a single result you sort of do this in a incremental fashion or in a moving fashion and then you still produce the tuple as the output but along with the value that it produced from the window function. Right? So the basic syntax is like this you have the function name and then you have an over clause. So the function name will be our aggregation functions and other special window functions we have which I'll show in the next slide and then the over clause defines how we actually want to slice up the data. This is sort of like combining together the aggregation and the group by but in a single clause. So the function is like the aggregation function like the group by. So the aggregation function will be all the things in the SQL standard that we talked about before min, max, average, count, sum the special window functions can do things like introduce a row number to the current row so to keep track of the tuples as it's being output and then marks them with what order they're coming out and then rank would be the order of the position of a tuple if we're doing sorting. Right? So say what I want to do is I want to go do a select over the enroll table and I want to produce all the tuples as my output but I want to just mark them with the row number of the output. So I have my row number function and then for my over clause I just leave that blank. And then what I end up with is a result that looks like this. I have all the data that I have before but now I have this special column here row num that's just again the order of the tuple that it was produced in the output. So sort of like I compute my entire query and then I do my window function to go over the results and then I add in whatever the computation that I want to generate. So aggregations we can combine things together or group them together this is what the over key word does for us. So for this we would use partition by to specify how we want to group things. So in this query here I'm doing the same thing as before where I want to combine them together based on the I want to generate the row number for how they produce in their output but then I'm going to group them together based on the course ID. So my output would look like this. And so now again it looks just like the aggregation where now I'm grouped together based on the order. Pretty simple. Let's look at something more complicated. So I can also order by these I can order them by partitioning them and this is essentially defining how we want to do our ordering to produce our output. So we do this ordering then we compute whatever the window function is that we want to compute on them. So in this case here this is essentially going to do the same thing I did in the previous slide where it's going to more or less group them based on the course ID but this is doing this by using ordering rather than partitioning. So let's say we want to compute we want to find the student with the highest grade of students. So what we have here is now we have nested query. So in the outer query we're just going to well in the outer query in the inner query we're going to produce some table result but now this is a good example where we're having a nested query inside of the from clause. So I have from here and then I'm taking the output of this inner query and I'm going to map that into a sort of a temporary virtual table called ranking that only exists for this tuple. So instead of writing it to a table that exists or writing it to a temporary table I don't want to say in memory because it may actually go out the disk but a temporary tuple table for this query that then gets discarded when the query is over. So in the inner query what we're going to do is we're going to go over the enrolled table and for every single tuple we're going to split them up based on the course ID that's the partition clause and then we're going to sort them by their grade in ascending order and then what we produce the window function we invoke is rank which is the order that they exist in the sort of ranking and then we take that output write it out to the ranking table and then in the outer query we can then do additional filtering based on their rank so this is only going to produce the tuples that are ranked first so this is finding the students with the highest grade for each course so the thing to point out here is I'm referencing in my outer query this rank attribute here which actually doesn't really exist in the database so it only exists within this query I'm seeing a lot of blank faces so let's maybe pop open the database and see what we can do. So for this Postgres is the only one that actually supports this my sql8 supports window functions and the newer version of sql8 supports window functions but for simplicity we'll just do in Postgres so this is the query that we have before let's break it up and just do the inner query first and that way it'll be sort of easier to understand so we'll just go back and run this so again, what this is going to do is going to go over the Enroll table it's going to partition each record based on the course ID so you see that here where 5445 appears first then 721 then 826 and then now within each of these partitions it's then going to sort the tuples based on the grade in ascending order in ascending order so B comes for C A comes for C and then there's nobody else taking 826 so the B is by itself so that's how we got that output in the form that we defined so now the rank function is going to be computed based on where each tuple appears in the sorted output list so in this case here this first tuple within this partition this guy came first so he gets rank one this guy came second so he gets rank two right if I change this now to be in descending order it still does the same thing but I could do an order by again and now I want to go in descending order this is doing the order by this did the order by after I did my window function so it doesn't really make sense but the rank is different than the run number because the run number says where do you appear in the output the rank is where do you appear in the sorted ordering okay yes say it again sorry he says if you have a rank without an order by we'll just return a random order well you'll still get it ordered by what you'll happen is you'll still get them probably split by partitions doesn't have to be because it's unsorted but that's what you'll get but now here yeah so there in this case here there is no sort ordering anymore so everyone has the rank of one right but it still grouped them up together based on what I'd find as my partition and that's more of an artifact of how the database system actually executed the query rather than the semantics of the query itself his question is what does the rank function do the rank is the rank of the rank function produces the rank of the sort order right so if I change my window function to be row number I have my partition but it's one to one to one right because within each partition that's what I appear so I remove actually the partition by then it should go through one through five right like that if I change this now to rank everything should be one one one right because there is no sort order everyone's first yes no question is the point is to do rank without an order by yes because there's no sort order there's no ranking alright cool alright so the last thing we want to talk about is ctes complex table expressions so this is probably again my this is I find this very interesting this is probably the one more complicated things that you can do in SQL and it's going to look a lot like nested queries in that you're taking the output of a query using as the input for another query but we'll see in a second what you can do with ctes that you can't do in nested queries yes so her question is in the case of this particular query or just in general so her question is is there a particular advantage of using rank with a window function over using a group by so again if you use a group by that's not going to generate the same you're not going to get the tuples as part of the output anymore right so if I go back um so say I do this one here right row number by partition by so if I go back and I say I want to compute now like the the max grade um from enrolled group by course ID like I no longer see what the original tuples were because they get collapsed together in the aggregate function the window function still produces all the tuples that were in your output but I can now see them as I can still see them as the output so if I go back here and if I this may not work but let's see so maybe I try um max grade partition by course ID yeah so here that didn't quite work either what's that it's picking oh you know what it might be now they didn't like it there's the rank um oh yeah because I'm an idiot sorry yes it did work what I wanted was min yeah so the highest grade 15,445 is was a B right so I still see my original tuples I still see one student got a C one student got a B but the max for that group was B so I didn't lose the original tuples now you and your application code have to derive meanings from this output to get you know get the one feel that you want but this allows you to again still keep the tuples uh in a way you can't with a group by that's a good question yes this question is can you use row number after you use an order by so something like this right and then maybe order by oh yeah there yeah sorry that's why there's the as rank so let's do this row num oh I'm an idiot sorry no one can see this okay you yeah sorry my screen shows one thing this shows another and that shows this okay we're good right now if we remove the partition by and now we can maybe put descending right so now we'll get five four three two one right actually replace the replace the rank with the row number what do you mean let's do this online because I want to get to CTEs because you need it for the homework before we keep going okay we can try it afterwards okay alright so CTEs the way basically it's going to work is that we introduced this with clause this with clause is like a query that's going to actually before your regular query so we have with and we're going to define the name of our CTE and then we have our as clause and whatever is inside of the parentheses after the as the output of that query would then get sort of mapped to the name of our CTE and then in the select query that comes below it can reference it just as if it was a an existing table right so in this case here what will happen is I generate a CTE that invokes the query select one right selects don't have to have a from clause this just outputs this produces a single tuple with a single attribute with a value one and then this other query at the bottom to select star on it and there's outputs that single tuple so I can start doing more complicated things output columns from the queries inside of the CTE to given names and then I can reference them by that name down below in my statement at the bottom so this CTE produces a single tuple with two attributes with the values one and two and then these get mapped to the name column one column two and then in the select statement below I can just add them together pretty straightforward so let's go back and try to do that example we did before where we want to find the name of the student with the highest student ID that's enrolled in least one course so for this I have my CTE and inside of that I'll do my max on the student ID on the enroll table again that produces one tuple with that max student ID and then down below I can do a join where I'm going to reference the CTE that was generated above me and I just combined together the max ID that came out of this which is defined here with my student ID and then that produces that produces the one tuple that I want right again this is another example of how to do the same query at a high level semantically the same query just written in different ways so now you may be saying how is this any different than a nested query the answer is you can do recursion in a common table expression that you can't do in a nested query so bear with me here but this query is going to produce a sequence of numbers from one to ten like a for loop right so now we have with recursive we have the recursive keyword and then inside of our CTE definition we're going to do a union between a single query that produces the value one a single tuple with a single attribute one and then we're going to then union that with another query that actually references ourselves so in here we're going to vote on our own CTE and then we're going to take whatever the counter is that's produced by this and add one to it and produce that as the output right and we keep running this until our where clause actually gets tripped up where we try to go above ten and then now we stop producing tuples so now we have the answer that we want that we can then reference below in our CTE yes it's basically a union right it's union with duplicates sorry union without duplicates union with duplicates union without all strips it keeps duplicates union without all removes them might be the way we can test that though okay so let's do this in Postgres so this is the one query this is the query that I showed you right and we produce the list of tuples with a single value one to ten let me try to kill this is that better but be careful about CTEs with a cursive because again because you can essentially have infinite loops so in this query here I no longer have that counter where it's greater than ten produce check any tuples where they're less than ten this will run actually forever but what I did first was I defined a total Postgres to again I'm highlighting here you can't see it a total Postgres that any query that runs longer than ten seconds automatically kill it so when I ran this query now it no longer has the where clause to prevent it from going looking at tuples beyond ten and so essentially it runs forever Postgres recognizes that we're stuck in a query that's taking too long and it goes ahead and automatically kills it yes so so select one just does this right so now if I do select one union all select one right I get like that so if I think I remove the all yeah you get it removes duplicates right so union all will give me everything so essentially what's happening now is I'm calling my CTE say whatever tuples you have in the current value take the output and then add one to it here the plus one there it's invoking on on our source CTE called source for every single tuple in there take its output add one to it so in order to get that tuple it has to go then invoke the CTE which then produces back one yeah yeah it won't let you invoke it because it knows you're trying to reference yourself so this question is what happens if you try to if you remove the recursive clause again it's sequel it's declarative we know everything you're trying to do so you're trying to access a CTE table that's defined by yourself and it doesn't let you do that the recursive let you do that so let's try it now also with um let's try it by removing the union all and see what happens the all part yeah it still produces the correct answer so you don't need the union all right so in this example here again I'm synthetically generating a table that has a single value of one and then I invoke a query to get that tuple uh and add one to it but then it generates a new tuple which I can then invoke again and add one to that and I keep doing that until I don't produce any more matches right because the recursion ends because I've reached my limit of ten yes so this question is why why are there not duplicates because the first time you invoke it you had one the second time you invoke it you would have one and two um let's let's do this offline let's walk through it because again we're out of time right yes quick this question is do people use recursive CTEs is that very common yes absolutely yes so um this actually finishes up very nicely so the the second point I'm trying to make here is that in an ideal scenario you always want to have to be able to compute an entire query without having to bring any data locally or without having to go back and forth right so it also provides by running as a single query although the complexity it makes it harder to do query optimization but if you tell the data system everything you're going to want to do with this sort of piece of data then it can do a global optimization on that so by having a CTE you're not having to have additional logic be somewhere else outside the data system or go back and forth you're saying here's everything I want to do so CTEs are actually very common especially in newer applications recursive CTEs maybe less so but definitely CTEs but it's another way to write a nested query all right the other major thing that I want to point out is that again the language is from 1973 or 1974 but it's still widely commonly used it's being updated all the time and learning SQL is important because you're going to see this again throughout your entire life pretty much every single system database system that with some minor exceptions is going to support some variant of SQL CTEs