 Before we begin, any questions about the logistics of the course or anything that we discussed last class? Again, I'll talk about homework one at the end of the class, and then it'll be posted immediately after the lecture is over so everyone can start at the same time, okay? All right, so today's lecture is on advanced SQL. And by advanced, I mean going beyond what you may already know about basic SQL, right? It's 2018, SQL was invented in 1973. I imagine most of you have seen some SQL throughout your life, so I don't feel it's necessary to teach you the basics of it, so I want to spend time actually talking about the more complicated or interesting things you can do with SQL, right? So just to pick up what we left last class, we were discussing the relational model and relational algebra, and we sort of mentioned that with relational algebra, the goal was sort of at a high level to describe what the answer we wanted that the data system would compute rather than the exact steps of actually how to do it. So the way to think about this is say we want to sort our data. To tell the data system exactly what to do, we have to provide it with the quick sort or bubble sort algorithm. But with a high level language or declarative language, we just say, hey, we want you to sort this. We don't care how you actually do it, but this is the answer that we want. And we'll see this throughout today's lecture and then going forward throughout later in the semester. This is one of the advantages of using something like SQL to our declarative languages. We don't have to tell the data system exactly how to do things. It can figure it out on its own and that frees it up to figure out what the most optimal way it is to actually execute the query that you want to execute based on the data that you have and the hardware that's available to you. So we'll see this later on when we talk about the query optimizer, but the query optimizer is this complicated piece of machinery inside a database system that's going to take our SQL query and convert it to the most efficient plan. And we'll cover how they actually do this later on. But I will say is that if you get involved in doing research on query optimization or just help working on query optimizers, you can get a job yesterday. Because this is the one thing that all my friends at Davis Companies email me about open over again, do I have any students to do query optimization? We've had some and they all go off and do awesome jobs. This is the one thing that everybody wants because it's really hard to do. So that's not the focus here, right? Just in the back of your mind we'll see examples as we go along where a query optimizer could try out different things to try to come up with different plans. We'll discuss how you actually do query optimization later on. So the history of SQL, as I said, goes back into the early 1970s. So you may notice that I say SQL, some people say SQL, right? The part of the reason I say SQL is because not that I was alive in the 1970s, but the original name of the language was actually spelled out S-E-Q-U-E-L, SQL, right? And this was invented by IBM as part of the system R project, right? So it stood for the structured English query language. So if you remember from the Ted Codd paper, I said that Ted Codd was a mathematician, right? He devised the relational algebra and relational model, but he didn't actually define the programming language you would use to write queries on it, right? You can't write queries using relational algebra, right? It's sort of, it's very difficult to write that in the keyboard to do that. He did later define or came up with his own query language called Alpha, but that was much later in the 1970s. So back then when people said, hey, there's this relational model idea, we should actually try to build a system to do this. People had to come up with their own language that could implement relational calculus or relational algebra. So at IBM, they came up with SQL. This was part of the system R project, which was sort of one of the first relational database systems that people were trying to build in the 1970s. The other major one was Ingress that came out of Berkeley. So you've ever heard of Postgres, right? Postgres was invented by the same guy that did Ingress. So it's called Postgres as in Post-Ingress, the thing that came after Ingress. So the Ingress guys had their own language called Quell, and this was developed by one of my advisors, Mike Stonebreaker. He claims that it was much better than SQL, and the IBM guys didn't know what they were doing. Of course, most of you have probably never heard of Quell, right? So it didn't actually win IBM 1. So back then, again, it was spelled out S-E-Q-E-L. IBM later got sued for, I think, copyright infringement or trademark infringement. There was some guy in England that had the term SQL spelled out in the full English word for his programming language. So it is shortened it to be S-Q-E-L. So what happened was, the reason why we use SQL today, because IBM is not, back then IBM isn't what it is now, right? Everyone thinks the big tech companies like Microsoft, Amazon, and Google. Back in the 1970s and 1980s, IBM was the juggernaut. So essentially, whatever IBM did or said this is the way we're gonna do it, that ended up being the standard. So when IBM first released their first commercial relational database system, DB2, so they never actually released system R, it's just sort of a research prototype, but then they finally made DB2, released that. DB2 supported SQL, so that essentially became the standard. And the reason why Oracle sort of took off and got as big as it is today is they were sort of copying what IBM was doing in the 1970s in more ways than one, we could talk about that later. And they had SQL, so when IBM came out with DB2 and they had SQL, Oracle was at the right place at the right time said we support SQL too. So it became an anti-standard in 1986 and became an international standard in 1987, and now the short version just means the structured query language. So SQL, even though it's from the 1970s, it's not a dead language. It's not sort of static. It's sort of like C++, they keep releasing new specifications every so often. It's the same thing in SQL. Every so often there's a new specification where they add in new features and new functionalities to the basic language. So the latest standard is defined in SQL 2016, and you can see over the years, as they add new versions, they add new features. So in 2016 they added JSON polymorphic tables. They add XML stuff in 2003, 1999 added regex and triggers. Typically what happens is there's a standards body that the members are all from the major database companies. And the major database companies come up with their own proprietary features and extensions and then they go in the standards body and push to try to get their version of certain functionality as part of the standard, right? So this is, although there is a SQL standard, nobody actually follows it to the T, right? Because everyone sort of has their own proprietary things that got invented before the standard came out and said this is how you should do things. So if you're going to claim that your database system supports SQL, the bare minimum you need to have is actually what is defined in the SQL 92 standard. So this is what the basic SQL that we know about today, select, insert, update, delete, create tables, transactions, things like that. All that's defined in SQL 92. So again, if someone says their database system supports SQL, chances are they really mean this. And then the more advanced databases, both in open source and the commercial ones, they have a bunch of more features from the newer standards. And then there's this great website here, it's a bit dated now, but it's interesting to go look at. There's some random dude who basically looked at sort of the top four, top five database systems and just looked to see how they differ on various SQL features and their various SQL functionalities. And we'll see this as we go through today. There'll be some examples where the standard says one thing, different database systems do other things. It's usually my SQL and even though there's a standard, nobody actually follows it exactly. There's no database system that I'm aware of that would claim that they're certified for SQL 2016. They have bits and pieces of things. So SQL itself technically is not a single language, it's sort of a collection of things. And in particular, it's a collection of a DML, DDL, and DCL commands. So the data manipulation language would be the commands of like insert, update, delete, selects, the things that actually manipulate the data that you're gonna store in your database. The DDL is the way you create tables or define schemas to actually store things. And then the DCL is the way you sort of do security authorization to grant who is allowed to read what data. So it's a bunch of other things like how to define views, how to define integrity constraints, referential constraints, as well as transactions. These are all part of the umbrella of SQL, right? And within that there's these different categories of commands. So the one important thing that I wanna point out here, and we'll see this throughout today's lecture, is unlike in relational algebra, which is based on set theory or sets, SQL is actually based on bag algebra. So the way to sort of think about this, you can have lists, sets, or bags. So a list can have duplicates, but there's a defined order. So if I push something to my list, that's its position in that list. A set is unordered, meaning the elements don't have a position, but you can't have duplicates. If I try to insert the same thing into a set, it just gets over, the old one gets overwritten. A bag has neither a set position or ordering, but it also allows for duplicates, right? And we'll see why we have to do this as we go along. Because if we wanna actually define order on our elements, or if you wanna make sure that we don't have duplicates, essentially the database system has to do extra work to provide that for you. And so the idea is that only if you explicitly ask the database system to provide you ordering or provide you to remove duplicates, it won't actually do it. And this will actually make things be more efficient. All right, so the outline for today is we're gonna cover aggregations, group buys, a bunch of operations on strings, dates, and times. Then we'll have to do output control. And then the more complicated things will be nested queries, comment table expressions, and window functions. So for homework one, you'll need to use all of these except for the window functions, right? Because you're gonna be doing homework one on SQLite, and only actually the latest version of SQLite as of last week, just added support for window functions. But everything else, SQLite should be able to support. Okay, all right, so for this, we're gonna use a sample database, comprised of three tables. So it's a mock university. So we're gonna have a student table where students have student IDs, names, logs, and GPAs. We'll have a course table with course IDs and name. And then we'll have an enrolled table where we have a foreign key reference from the student table and the course table. And along with the grade that the student got in the class, all right? We'll use this as a running example as we go along. All right, so the first thing we talk about are aggregations, right? And these are pretty simple to understand. It's basically a function that you define in the output list of your select statement that's gonna take as input multiple tuples, a set of tuples, and it's gonna compute some kind of aggregation on top of that and produce a single result, right? So the SQL 92 standard defines average min, max, sum, and count. And again, think of this as like, in case of count, you're gonna take a bag of tuples as the input, and you're gonna count the number of tuples there are. And you're gonna produce a single output that has that count, right? So again, so this is what's in the basic standard. The later version of the standard and in other database systems, they'll have other things like median mode standard deviation, right? They'll have different kind of aggregates and some of them actually allow you to find your own aggregations as well. So let's look at an example like this, right? So say we want to just count the number of students in the student table, where the login ends with at CS. I'll cover what like is means later on, but essentially it's just looking for a wild card here, right? So the first thing, the most important thing to remember about aggregations is that the aggregation can only appear in the output list of the select statement, right? So I have it here in this, the same, this is what I want to produce at the output. I can't have it in these other parts here, right? Cuz it doesn't really make sense, right? Cuz you're computing aggregation after you've sort of applied a filter to figure out what tuples actually match your wear clause. But the thing to point out here is that in the case of count, again, we just wanna count the number of tuples. The login field here doesn't actually mean anything, right? Cuz we're just counting the number of tuples. It doesn't matter what the login actually is at the point after we do the filtering. So we can rewrite this just to have a star, right? The star is a special keyword in SQL that basically says all attributes for the tuple. We can go even further and say, we can actually replace the star with a one, right? Count the number of tuples by just adding one every single time, right? So this is a good example where we have three different queries that all produce, are semantically the same, they produce the same result. But the data system could choose different flavors or different variations of this in order to derive that answer. And some of them may have different performance differences. This one's pretty simple. So most data systems would be smart to realize, I don't need to copy around the login here. I can just replace that with a one, right? Most of them will do that. But for more complicated things, it may not always work. We actually can combine multiple aggregates together in a single query. So say for this one, we want to get the number of students and or average GPA where the login ends with at CS, right? So now you see I've combined average GPA and count in my output list for my select statement. And then it'll produce my result like that, right? Pretty straightforward. I can also add the distinct keyword to tell it to only count the distinct elements or values of attributes for my tuples. So this is saying count the number of students with unique logins from the student table where the login ends with at CS, right? And so I have the distinct side of account. Now this one's sort of nonsensical too in some ways because presumably no two students can have the same login account, otherwise you'd have problems. But in other cases, you can apply the same kind of thing for other scenarios and it would work the way you want it to work, right? And in this case here, we produce the same result. So the one thing that now you want to maybe want to try to start doing is now that I'm starting doing aggregations, I want to get additional information about my data outside of just what I'm computing in my aggregate. Right? So say that I want to get the average GPA of the students that are enrolled in the course and I want to know what that course ID was, right? So in this case here, I've now added the course ID to my output list outside of my aggregation, right? Do you ever think I guess what would happen here? Would this work or not? Raise your hand if you think it would work. Raise your hand if you think it wouldn't work. A few of you. Why? It's single CIDs that hold. Correct, yeah. So what he said is there's not a single course ID for all my tuples that I'm computing my average on, right? It's all the students are taking all the different classes. What course ID do we actually put as the output, right? It's the SQL standard to say this is actually undefined and most systems you actually will get an error with this. We can actually test this, right? So in this, here we go, I have three terminal set up, right? So this is running a machine back in my office. I have three panels. The top one is Postgres. The bottom one or the middle one is my SQL and the bottom one is SQLite. So it's much easier for me to type from this machine here. So I can log into that, right? Okay, so the query was we wanted to select, where was it? Let's get the average GPA of students enrolled in each course, right? So select average SGPA, e-course ID from enrolled as e, student as s, where e-student ID equals s-student ID, right? So Postgres says you can't do this because as he said, the course ID, and I'm highlighting here, so you can't see it. So the course ID is not defined, it's not part of the aggregation, right? So it doesn't know which course ID you actually want. So now if we go down and try this in my SQL, my SQL gave us an answer, right? But is that correct? No, right? Because what course ID did it pick? It picked a random one, right? And then now we can try in SQLite, same thing. SQLite gave us a different course ID. So you see they both computed the correct average, but they chose a different course ID, right? And just because I know that my SQL guys watch these videos and complain. I will say that, so this is running my SQL, so my SQL traditionally allowed you to do lucy PC things like this, right? So this would be running at what they would call traditional mode. But you can set the SQL mode to be more strict. So now if I run that same query, it throws the same error that Postgres did. So by default, these MySQL 5.7 would now throw errors. In older versions, they didn't do that, right? Okay, so the way to fix this is to do group by. So with group by what's gonna happen is, now we're gonna define how we wanna essentially bucket together the tuples in our output based on one attribute. And then now we can then compute the aggregation on the tuples in each bucket, right? So again, what I wanted to do was get the average GPA per course. If I add now in my group by clause, I add the course ID. What'll happen is, when I first execute the query, this is essentially what I see after the join. But now with the group by clause, I'll combine together the tuples based on the course ID and for each of these, now I'll compute my aggregate, right? And now I can group them by course ID and that corresponds to the buckets I generated from before, right? So we have to use group by if we wanna extract information about the aggregations we're computing, right? And so just like before, anytime we have a attribute that we wanna appear in our output select clause, it has to appear in the group by. So here, the student name, we're trying to throw that in our output clause. It won't work because it has to be in our group by clause. This one is sort of, again, nonsensical. It's technically correct SQL, it'll produce the result. But from a human standpoint, we know it doesn't mean anything. What does it mean to group by course ID and then the student name to compute the average GPA, right? The student's GPA is gonna be their GPA. Okay, so now you may think, all right, well, maybe I wanna do some additional filtering on my aggregations so that I don't produce every single result as part of the output. So you may think that you wanna put it, use the aggregations in your where clause. So here now I'm computing average GPA, but I'm adding in my where clause something that says filter out anything where the GPA is not greater than 3.9. This is only trying to look at the courses where the average GPA is greater than 3.9. Right, so this doesn't work because we can't access anything in our aggregations in our where clause because we don't have them yet in our where clause, right? So again, the way to think about this is the where clause is filtering tuples as we go along. And after we do our filtering, then we can actually then compute our aggregation. So we can't compute, we can't use an aggregation to filter tuples because we haven't computed the aggregation yet, right? So the way to get around this is use the having clause. It's essentially like a where clause, but now you can reference anything in your output list. So I've taken my average GPA and I've alized it to AVG GPA and I can reference that down here. And this will produce the answer that I want, right? Compute the aggregation just like before and then it applies additional filtering on it. So now, not in this example, but there's other examples where you can do certain optimizations in the system based on what you know is in your having clause. Again, this is the great thing about declarative language. You know everything ahead of time, what the programmer, the person wants in their answer. So you can use hints about what you know is going to come later in the query plan to help you do certain refinements or reduce the amount of work you have to do as you run the query. So instead of using average GPA, say I wanted to do account where the number of tuples are less than some, for each group, less than some value. So as I'm going along, as I'm computing my aggregate, if I then recognize that, my having clause says filter anything less than 10 tuples, if I now hit tuple number 11 for a group, I know that I never need to count anything else for that tuples that come later, because it'll never pass that having clause. So I can just throw away this tuples as I go along because it's wasted work. If you're writing this in a procedural language like Python, where you had to write these steps by yourself, the data system's not going to know what's going to come later on, because it can't peek ahead. But in a declarative language like SQL, we can do that. So the next thing we want to talk about is how to handle strings. So this is a table that sort of summarizes the different variations of how strings are handled in different database systems. So the SQL standards best says that all strings like so barchars, chars, text fields, they had to be case sensitive. And the way you declare them is with single quotes. And most systems follow this, right? The red herrings or the odd balls are both MySQL and SQLite. So in SQLite, the strings are case sensitive, but you can have both single and double quotes to demarc them. In MySQL, it's case insensitive, and you can use either single or double quotes. Now, my problem is that when I first started using database systems back in like 2000, I was using MySQL 3, and it's ingrained in my memory just to use double quotes because that's what we use back then. So I'm always had to correct myself every time I switch over to another database system to always go back and use single quotes. MySQL, again, has the mode you can say to follow the standard a little bit better. And they then enforce the single quote requirement. But I think by default, you don't get that, right? So in the SQL standards, say you want to take the name of Kanye and have it match with Kanye with mixed case. In the SQL standard, you have to uppercase, use the upper function to make this match work. In MySQL, you don't have to do that, right? Because all the strings are matching or case insensitive. Now, I showed this before how to do light clause. But thankfully, it allows you to look for string matching, right? So light is basically doing, trying to match some wild card in your string with another string. So for whatever reason, instead of using star, like most of the things, they use the percent sign. So percent sign means any one or more strings. And then the underscore means exactly match one character, right? So say I want to get all the course IDs from the roll table that start with 15, so 15, 445, 721. I would use the percent sign like that. And say if I want to get all the students where the login ends with at C and then some wild card, I would use a single underscore for that. So the SQL standard also defines a bunch of string functions. Again, the standard things you would expect, like substring, upper, lower, trim. These are all part of the SQL92 standards and most systems will have this. But of course, there's gonna be a bunch of other stuff that are proprietary in each different database system. So the thing to point out, unlike aggregation functions with string functions, and mathematical functions and date functions, they can appear anywhere in your queries. So they can be in the select output list. It can be in your predicates. They don't always have to be in the select portion. They can be in your having clauses. So here we have the first query is gonna take the name and take a substring, just get the first five characters. And then the second one we did is we showed before of doing matching where the uppercase version of the student name begins with K, K, a, n. So again, string functions are for the most part, the basic ones will be pretty much standardized across all systems. Where things get weird as when you want to start doing concatenation. So the SQL standard says that you use two double bars to concatenate two strings together, and most systems follow this. In my opinion, I would say Postgres and Oracle are probably followed the SQL standard the best set of all the systems. SQL Server and DB2 are probably next. SQLite, then probably my SQL is the worst. So you want to concatenate the student name and add CS to it. Use double bar in the SQL standard. In SQL Server from Microsoft, you use the plus sign. In my SQL, they don't have the plus sign, they don't have the double bar. And you only have this concat function where you basically define as the input parameters the things you want to concatenate together. They have something else that's kind of weird too, where you can actually have, if you don't put anything between two string literals, that'll concatenate them together. So to show you what I mean by that, so here we have my SQL. So I can do something like this, an, dy, space, p, a, v, o, right? Nobody else does this, as far as I know. Let me double check that before I lie to you. Let's try it in, yeah. Postgres doesn't like it. SQL Server doesn't like it either, right? Only my SQL does that. So now where things get really, really bad is when you have date and time functions, right? So a date basically just records the time stamp without the time. Time stamp is the time. Now they vary on the granularity of the time they may be tracking. Sometimes it's in seconds, sometimes it's in milliseconds or something even more fine-grained than that. But where things get wonky is how you actually start manipulating them and extracting them, extracting information from them, right? And so the syntax for all these different systems can vary wildly. So for this example we wanna do is I wanna show you how to do what I consider a seemingly easy function or easy operation. Get the number of days from today to, since the beginning of the year. And we'll see how to try to do this on these three different systems, okay? So the first thing is that we wanna get maybe what the current time is. No, I just killed my Wi-Fi back, okay. All right, so the SQL standard defines a now function. Can everyone see that or no? In the back guys, can you see it? Okay, cool. So you call now and it gives you the current timestamp. So we can do that in my SQL, and we can do this in SQL Lite. But you can't do it in SQL Lite, all right? So there's another way to do this. They have another function called current timestamp, right? Except Postgres doesn't have it. But they have the keyword, current timestamp. My SQL has the function, and it has the keyword, all right? SQL Lite, who says it has the function? Raise your hand. Some, who says it has the keyword? All right, it's mixed, all right. They don't have the function, they have the keyword, all right? Again, simple thing like what's the current time? It varies. All right, so now again, what we wanna do is we wanna count the number of days since the beginning of the year, all right? So as a building block, what we can try to do is maybe just get what the number of days is from today. So in this query, what I'm doing is I'm taking a date, let's say here's a string, right? So I'm taking the string for today's date and converting it into a date, and then I have the extract function, which will extract out the day field of the date, right? So as expected, you get 29, all right? We can try that in the other guys. It gives you 29, they don't have it, all right? So, but now we can see that, all right, well, if I can cast a string for the date into a date, what happens if I just subtract them, all right? So I'll take today's date and subtract it from the date from the beginning of the year, and in Postgres we get it, 240 days, right? My SQL, 728, I don't know what that means, right? It's not like 240 times two or some multiple of it, right? It's a number, right? It ran, but it's not what we want, right? So now let's try it in SQLite, SQLite gives us zero, okay? So it works in Postgres, so now we gotta figure out how to do this in MySQL. Well, so what we can do is we try to extract the day from, so maybe we take what we, that 728 we have before, right? And we extract the day from, see what that means, right? It came back with 28, so that's not really what we want. So the way to do it actually is a bit complicated. What we're gonna do is we're gonna convert the dates into Unix timestamp, which is the number of seconds since the Unix to Epoch, which is like some January 1st, 1970, right? And now we're gonna have the number of seconds from the current date, number of seconds from the beginning of the day, we subtract them, right? And that gives us the number of seconds between now and the beginning of the year. And then we're gonna divide that by 60 seconds times 60 minutes times 24 days, right, we got 240, all right? Turns out though, after I did this the first time, there's actually an even easier way in MySQL. They have a simple function called date div, and that produces the same answer, all right? SQLite, let me try to make this a little bigger, there you go. In SQLite, they don't have date div. You can't subtract dates with each other. The way I figured out to do it was convert the current timestamp to the Julian calendar, which is the number of days since 437 BC, or 4,370 BC. And so you subtract the number of days since that time, it's the number of days beginning of the year, and you get roughly 240, right? And we can cast it to an int and we get 240, right? So these are three super widely used database systems that all differ on some basic functionality of doing date and time, right? Actually, a quick show of hands. Who think of these three ones, which one do you think is the most popular, most widely deployed database system? Raise your hand if you think MySQL, about a quarter. Raise your hand if you think Postgres, even less. Raise your hand if you think SQLite, even less. The answer is SQLite. So a few years ago we had Richard Hipp, the inventor of SQLite. So SQLite is amazing. It's written by three dudes, right? He approximates that it's been deployed on 10 billion devices. Everyone here who has the cell phone, right, isn't a flip phone or an old person phone, is running SQLite on it right now, right? A lot of desktop applications, like Photoshop and Illustrator, they run SQLite on the inside, right? SQLite is everywhere. He said that also too, every single AOL CD, if you know what that is, all right, back in the early 2000s when the internet was sort of new in the US, there was a company called America Online, and they would mail everyone CDs for 10 hours for you on the internet, right? Every CD that they mailed out, hundreds of millions of them had SQLite running on it, right? SQLite is the most widely deployed database system that's everywhere, and here's the most crazy part. It's public domain. He gives it away for free, right? I mean Postgres and MySQL are open source, but like MySQL, Oracle owns it. Oracle owns the copyright. There's no copyright on SQLite, right? It's an amazing piece of software, okay. All right, so again, the main takeaway here was that simple things are hard to do, because there's no standard way to do things, even though there is a standard specification. All right, so now maybe what we wanna do is instead of having the, you know, in my examples, I had the terminal open. Every time I ran a query, the output got printed back to me in my terminal, but maybe what you wanna do is keep all the data you generate from a query, keep that inside the database system so you can use it in subsequent queries, right? This is way more efficient if your query's generating a lot of output. You don't have everything send down to your laptop and then push it back up to do more queries on it. So you can do output redirection to tell it to say, hey, don't print it out to me, write it out to this location. So one thing you need to do is you can take the output of query and you can write it into another table, right? And so in the SQL standard, you can use into, and this actually will create the table for you on the fly. So whatever produces the output of the query, right? Again, this is declarative. The data system already knows what the schema is of the table, so it knows in this case here what the type is of this output's gonna be. So it knows how to define a table that has that, that can handle those types, right? So intu basically takes the output of select and writes it into a table. In my SQL, you have to use the create table and then inside of it, you define your select statement. You can also have it output data into an existing table. And for this, you use insert into, which looks a lot like the create table from before. And then instead of having the value clause, you actually just have a select statement just to tell it, hey, get this data and write it into here. So the important thing about this example versus the previous slide, this is about writing tables into, tuples into tables that already exist. So that means that whatever the type, the number attributes and their types that are produced by the select statement, whatever table you're writing it into has to match that, right? If the select statement has four columns and your table you're writing into has three, the day system will throw an error and says I can't write it into that because the attributes don't match up. Now where things get weird is when you start having constraints that are on the table you're trying to write into and the select statement actually violates those constraints. So let's say that I have a primary key on my table that says I can't have any duplicate student IDs and then my insert query here tries to start inserting duplicates. Some database system will throw an error immediately as soon as it sees a duplicate and no tuples get written. Some of them will just keep going and ignore the ones that failed. Other ones will insert the ones that succeeded and just ignore the ones that failed or maybe just crash right away, right? So again the SQL standard says this is the syntax to use but how the systems actually implement it will vary widely. In addition to output redirection we can also do output control. So remember I said at the beginning SQL is based on bag algebra, meaning it's unordered. But there's many times where you want the ordering in your output clause. And so to do this you add the order by clause. So you're basically specifying how to sort the results that are being generated by this query. So in this example here I'm gonna sort the tuple, sorry the tuples from the enroll table based on their grade. And by default even though I don't specify whether I want ascending or descending the default in SQL is that you get ascending. You get output like this. But I also can add additional attributes to my order by clause to do more complicated things. So in this case here I wanna do my order by by grade in descending order and then after that I sort them based on the student ID in ascending order. And I would get output like this. Now the important thing to point out here also too is that unlike in the group by clause where any attribute that I wanted my output list had to appear in the group by clause in order by you don't have that restriction. It actually goes both ways. In this case here I'm sorting by the grade but the grade isn't part of the output. It doesn't matter. It knows how to find the data that it needs as it's processing the query and do whatever sorting operation that you wanna do on it. I can also do more complicated things. I can put any arbitrary expression in my order by clause as well. So I can do order by one plus one, right? And that's still valid. It knows how to handle that. Another common thing you wanna do is limit the number tuples that are produced in your output, right? And of course there's a limit clause for this. You basically specify, it says I, for all the results of my query only provide me back some number of them, like 10, right? And this is very common for things like, if you say like search results, you'll see it show just 10 and then you click the next button to go to the next 10, right? This is being controlled by limit and actually offset. So limit just says limit the number of tuples. Offset is gonna tell you that at what offset of the number of the tuples you're producing as your output should you skip before you start figuring out how many you should limit, right? Again, if you think of a webpage to show you 10 results, you click next and see the next 10, they're using a limit with an offset to make that work. Now, because it's unsorted again, there's no guarantee that when you click next, if you're using an offset, you're gonna see, you're definitely gonna see different tuples because that's another invocation of the query and the results may be different in different order of the second time. So in this case here, you would combine that with an order by clause so that you're guaranteed to go, see the first 10 followed by the second 10 and so forth, right? And there's optimizations you can apply whether or not you have an order by clause if you have a limit clause. So if I don't have an order by clause and I have a limit, I know that as soon as I see 10 tuples, I'm done. I don't need to go look at anything else. If you have an order by, then you have to see everything ahead of time to sort it, then you can apply your limit in your offset. All right, so again, I consider that somewhat basic SQL. In previous years, I actually skipped all this, but I think it's important maybe to go over it a little bit because you have to understand this for the homework. But now we wanna talk about the more complicated things. This is what I would consider advanced SQL. So the first thing we wanna talk about is nested queries. So the way to think about nested queries is basically allowing you to specify queries inside of queries, right? And you can take the output of one query and use that as the input of another query, right? So a simple example like this, I'm doing a select, I wanna get all the names of the students that are enrolled in at least one course. So I have my outer query is defined based on the student table and then inside of that, I have an inner query that's gonna get the student IDs from the role table, right? So we could write this as a join, this is sort of another way to actually do this. And in actuality, when it comes time to actually implement this inside the system, most query optimizers will try to rewrite this as a join, right? Because the worst way to actually execute this is to essentially have two for loops where you loop over every single tuple in the student table and for every single tuple, you then re-invoke the same query over and over and over again. My SQL used to do this, more primitive database systems used to do this, but the right way to do is actually just rewrite this as a join, right? So you can sort of think of this inner query as like a function that's gonna produce some set of tuples as your output and then you can apply whatever predicate you want on that in the outer query. So we'll walk through a bunch of examples of this and see how it works. So the first example you wanna do is maybe you wanna get the names of all the students that exist that are enrolled in 15, 445. So the way to construct a nested query is I think it's always important to start maybe with the outer query and think about what's the actual answer you wanna produce? What are the actual attributes you wanna see? And then you worry about how you're actually gonna filter them and get what you want. So the outer query we know that we want the name from the student table and then where we get those values we'll figure it out. So in the inner query, we can write it in as English as the student ID and the set of people that take 15, 445, right? So for that part, we know how to write that query pretty easily, right? We just filter out all the tuples from the enroll table where the course ID equals 15, 445. So now the question is how do we combine them with the outer query with the inner query? Again, we can rewrite this very easily with a join but for our purposes here, we wanna see how to do it as a nested function. So for this, we can use the in operator and we do a matching on the student student ID. So the way to now read this is that for every single student in the student table, so this first student ID is matching up with the student table, I wanna see whether there's a match of that student ID in the set of all student IDs that are in the enroll table that take the course 15, 445. So we execute the inner query, we produce the set of all student IDs and then for every single tuple in the outer query, we check to see whether it exists in that set, right? And this shows why I was saying before, the stupid way to execute this is for every single tuple in the outer query, we execute the inner query over and over again, right? That's stupid because we only need to produce the inner query once and then we can reuse it for every single tuple in the outer query. So I showed how to, in that example I use in, there's other operators you can use, so all basically says that every single tuple that's in my inner query must satisfy my predicate, any means at least one of them needs to, right? So in is essentially the same thing as equals any, right? Is there any tuple that it equals what my predicate or what my tuple attribute is checking? And then exists as at least one row is returned. I don't care what it matches, I just care what I see whether a tuple got produced as the result. So we can rewrite our example before of getting all the students in 15, 445 like this, right, instead of using in, we say equals any, so now you read this as the student ID from the student table, check to see whether it equals any tuple that exists in the set of student IDs that are produced by from the enroll table, right? Now the inner, the nested queries don't have to only appear in the where clause, they actually can appear anywhere, right, so I can rewrite the same query like this, where now I have my nested query in the output of the select statement. So this is a good example, this is essentially reversing what I did before in terms of what tables we're gonna access. So now the way to read this is that for every single tuple in the enroll table, where the course ID equals 15, 445, I'm gonna then do a matchup in the student table where the student IDs are the same, right? This is essentially doing a join inside my output of my select statement, right? Because now this student ID is being referenced here from the student table and that student ID is being referenced in there, right? This is another good example, again, I'm essentially reversing the order of how I process my tables, right? And they produce the exact same result that they may have different performance characteristics based on what my data actually looks like. It may be the case that this is actually faster to go through the enroll table first instead of the student table and therefore we rewrite it if we wanted to choose one versus the other. And a good optimizer could do this for you, okay? All right, let's look at something that will be more complicated now. All right, say we wanna find the student record with the highest ID that's enrolled in at least one course, right? Seems pretty simple, but let's think about how we actually do this. So the first approximation would be something like this. Select the max student ID along with the student name from the join enrolled and student table where the student ID in the enroll table equals the student ID in the student table. Will this work? Why? Why not? Yeah, there's an aggregation function without a group I and we're referencing a column that's not in the aggregation, right? So again, the SQL standard says that this shouldn't work. Let's find out. All right, so again we have, I gotta disconnect. Postgres at the top. So we run our query here. Again, Postgres, and here's the standard. It says exactly as they said. So you have student name appears, but it's not a part of a group I so you can't use it, right? In MySQL, same thing, it produces that error, but if we now run it in what they call traditional mode, right, so older versions of MySQL would do this. Now we run this query and we get an answer, right? We have the max student ID and then we say that the name of that student is Tupac, right? Let's try it in SQLite. Oh, wrong one. It produced an answer. It also produces 53.688 as the max student ID, but it says that that student belongs to Justin Bieber, right? And actually, I don't know what the right answer is. Let's see here. Yeah, so Justin Bieber is the right one, right? Whereas MySQL, make sure I'm giving you the same data, Tupac is actually the lowest, right? So it got it completely wrong, right? So the, right, so this doesn't work in the SQL standard. It runs in SQLite and MySQL if we turn off that strict mode thing. So the way we can do this is a nested function is again, let's build it constructively. So we know that we want the student ID and the name as the output, but it's the where clause that we have to figure out and this one basically says that we want to get a matching tuple that is greater than every other student ID that's in our table, right? So we know the interquery should be basically the student ID from the enrol table. We can be more sophisticated, maybe put it distinct there, but it's all the same. But now we need to figure out how to match the student ID from the student and the outer query, the student ID from the interquery. And for this, we can use greater than equal to all, right, has to be greater than equal to because we're gonna make sure that we match ourselves and the student that actually is, that does actually have the highest one, right? We can rewrite this in other ways, as expected, right? We can rewrite it with the in clause like that and actually compute the max student ID and the interquery. Right now, basically it's the saying, match the student ID that is the max student ID produced from the enrol table, right? We can go even further. We can rewrite the inner guy to be like this, right? We do an order by the student ID and just rank them into sending order and then do a limit one. So some systems will actually just rewrite this one to be a max anyway, just scan everything, keep track of which one's the max and then produce as the final output. Don't even bother doing sorting, right? Again, saying query rewritten in different ways. One more example. We're gonna find all the courses that have no students enrolled in them. So we take our outer query as a select on the courses and then we know our inner query basically says we're gonna find where they have no tuples in the enrol table. So for this one, we want to use not exists. Basically says we don't wanna match anything in our inner query and all we need to do that for the inner query is just grab every single tuple and here now we're actually matching up the course ID in the inner query with the course ID from the outer query. So you can only do this in one direction. So if you have the inner query, you can reference the outer query. If you're in the outer query, you can't reference the inner query, right? Unless you pipe it out or redirect it to a table. So any questions about nested queries? Again, they're very powerful. Many times you cannot write what you wanna write in a single query without using nested queries. Yes. So this is a thing of inner queries as like a nested query. Her question is, can you think of an inner query as a nested for loop? Yes, but no. So we'll see for loop has a notion of like ordering. It's really a set, right? So all those operators like in exists any, those are just trying to say for the entire set of tuples that are in the inner query, check to see whether they're any of them matches. You're not really iterating over every single one. You can think of the outer query as a for loop because you're iterating every single tuple, but then the set portion, the evaluation of the inner query is always at sort of a bag or set level. Makes sense? Okay, window functions. So window functions are, I would say they're new but they're like 15, 10 years old now so they're not brand new. A lot of systems don't support them but the major ones do. So a window function is sort of like an aggregation where you're gonna compute some function on tuples but rather than doing it on a subset of the tuples and collapsing them down into a single result, you sort of do this in a incremental fashion or in a moving fashion and then you still produce the tuple as the output but along with the value that it produced from the window function, right? So the basic syntax is like this. You have the function name and then you have an over clause. So the function name will be our aggregation functions and other special window functions we have which I'll show in the next slide. And then the over clause defines how we actually wanna slice up the data, right? This is sort of like combining together the aggregation and the group by but in a single clause. So the function is like the aggregation function, the over is like the group by. So the aggregation function should be all the things in the SQL standard that we talked about before, min, max, average, count, sum. The special window functions could do things like introduce a row number to the current row. So to keep track of the tuples as it's being output and then marks them with what order they're coming out. And then rank would be the order of the position of a tuple if we're doing sorting, right? So say what I wanna do is I wanna go do a select over the enroll table and I wanna produce all the tuples as my output but I wanna just mark them with the row number of the output, right? So I have my row number function and then for my over clause, I just leave that blank. And then what I end up with is a result that looks like this, right? I have all the data that I had before but now I have this special column here, row num, that's just again the order of the tuple that it was produced in the output. So sort of like I compute my entire query and then I do my window function to go over the results and then I add in whatever the computation that I wanna generate. So just like a group or so aggregations, we can combine things together or group them together. This is what the over key word does for us. So for this, we would use partition by to specify how we wanna group things, right? So in this query here, I'm doing the same thing as before where I want to combine them together based on the, I wanna generate the row number for how they produce in their output but then I'm gonna group them together based on the course ID, right? So my output would look like this. And so now again, it looks just like the aggregation where now I'm grouped together based on the order, right? Pretty simple. So let's look at something more complicated. So I can also order by these, I can order them instead of partitioning them and this is essentially defining how we want to do our ordering to produce our outputs. If we do this ordering, then we compute whatever this, the window function is that we wanna compute on that. So in this case here, this is essentially gonna do the same thing I did in the previous slide where it's gonna more or less group them based on the course ID but this is doing this by using ordering rather than partitioning. All right, so let's say we wanna compute, we wanna find the student with the highest grade for each course. So what we have here is now we have a nested query. So in the outer query, we're just going to, well, in the outer query is the start of error. In the inner query, we're gonna produce some table result but now this is a good example where we're having a nested query inside of the from clause. So I have from here and then I'm taking the output of this inner query and I'm gonna map that into a sort of a temporary or virtual table called ranking that only exists for this tuple, right? So instead of writing it to a table that already exists or writing it to a temporary table, I don't wanna say in memory because it may actually go out the disk but a temporary tuple table for this query that then gets discarded when the query is over. So in the inner query, what we're gonna do is we're gonna go over the enrolled table and for every single tuple, we're gonna split them up based on the course ID, that's the partition clause and then we're gonna sort them by their grade in ascending order. And then what we produce, the window function we invoke is rank which is the order that they exist in the sort of ranking. And then we take that output, write it out to the ranking table and then in the outer query, we can then do additional filtering based on their rank. So this is only can produce the tuples that are ranked first, right? So this is finding the students with the highest grade for each course, right? So the thing to point out here is I have, I'm referencing in my outer query this rank attribute here which actually doesn't really exist in the database. Again, it only exists within this query. I'm seeing a lot of blank faces so let's maybe pop open the database and see what we can do. All right, so for this, so Postgres is the only one that actually supports this. My SQL 8 supports window functions and the newer version of SQL Lite supports window functions but for simplicity, we'll just do in Postgres. All right, so this is the query that we have before, right, let's break it up and just do the inner query first and that way it'll be sort of easier to understand, right? So we'll just go back and run this. So again, what this is gonna do is going to go over the enrolled table. It's going to partition each record based on the course ID. So you see that here where 15, 445 appears first, then 721, then 826. And then now within each of these partitions it's then going to sort the tuples based on their grade in ascending order, or sorry, in ascending order. So B becomes four C, A becomes four C and then there's nobody else taking 826 so the B is by itself, right? So that's how we got that output in the form that we defined. So now the rank function is gonna be computed based on where each tuple appears in the sorted output list. So in this case here, this first tuple within this partition, this guy came first so he gets rank one, this guy came second so he gets rank two, right? If I change this now to be in descending order, it still does the same thing, but I could do an order by again and now I want to go in descending order, right? Well, this is doing the order by, this did the order by after I did my window function so it doesn't really make sense, but the main thing, the rank is different than the row number because the row number says where do you appear in the output? The rank is where do you appear in the sorted ordering? Okay? Yes? Say it again, sorry? He says if you have a rank without an order by, we'll just return a random order. Well, you'll still get it ordered by, what you'll happen is you'll still get them probably split by partitions. It doesn't have to be because it's unsorted, but that's what you'll get. But now here, yeah, so in this case here, there is no sort ordering anymore so everyone has the rank of one, right? But it's still grouped them up together based on what I'd find as my partition and that's more of an artifact of how the database system actually executed the query rather than the semantics of the query itself. His question is what does the rank function do? The rank is the rank of, the rank function produces the rank of the sort order. So if I change my window function to be row number, okay, I have my partition, but it's one, two, one, two, one, right? Because within each partition, that's what I would appear. So if I remove actually the partition by, then it should go through one through five, right? Like that. If I change this now to rank, everything should be one, one, one, right? Because there is no sort order, everyone's first. Yes? Is it pointless to do a rank without an order by? Yes, because there's no sort order, there's no ranking. All right, cool. All right, so the last thing we wanna talk about is CTEs, complex table expressions. So this is probably, again, this is, I find this very interesting, this is probably one of the more complicated things that you can do in SQL. And it's gonna look a lot like nested queries in that you're taking the output of a query and using it as the input for another query, but we'll see in a second what you can do with CTEs that you can't do in nested queries. Yes? So her question is, in the case of this, is it this particular query or does it in general? So her question is, is there a particular advantage of using rank with a window function over using a group by? So again, if you use a group by, that's not gonna generate the same, you're not gonna get the tuples as part of the output anymore, right? So if I go back, so they say I do this one here, right? Row number by partition by. So if I go back and I say I wanna compute now like the max grade from enrolled group by course ID, like I no longer see what the original tuples were because they get collapsed together in the aggregate function. The window function still produces all the tuples that were in your output, but I can now see them as, I can still see them as the output. So if I go back here, and this may not work, but let's see. So maybe I try max grade partition by course ID. Yeah, so here, that didn't quite work either. What's that? Oh, you know what, it might be, no, they didn't like it. There's the rank. Oh yeah, because I'm an idiot, sorry, yes. It did work. What I wanted was min. Yeah, so the highest grade of 15, 445 was a B, right? So I still see my original tuples. I still see one student got a C, one student got a B, but the max for that group was B. So I didn't lose the original tuples. Now, you and your application code have to derive meanings from this output to get the one field that you want, but this allows you to, again, still keep the tuples in a way you can't with a group by. That's a good question. Yes, this question is, can you use row number after you use an order by? So something like this, right? And then maybe order by, oh yeah, yeah, sorry, that's why. There's the as rank. So let's do this, row num. Oh, I'm an idiot, sorry, no one can see this. Oh, can you? Yeah, sorry. My screen shows one thing, this shows another, and that shows this, okay. We're good, all right. Now, if we remove the partition by, and now we can maybe put descending, right? So now we'll get five, four, three, two, one, right? It's actually a thing about just replacing the rank. We replaced the rank with the row number, what do you mean? Yeah, so like you can say like, over thing, there should be like a partition by, and we're about, we're already collected, and then we would just change that to the number. Let's do this online, because I want to get to CTEs, because you need it for the homework before we keep going, okay? We can try it afterwards. Okay. All right, so CTEs, the way it basically is going to work is that, you're gonna have, we introduced this with clause. This with clause is like a query that's going to execute before your regular query. So we have with, and we're defining the name of our CTE, and then we have our as clause, and whatever is inside of the parentheses after the as, the output of that query would then get sort of mapped to the name of our CTE, and then in the select query that comes below it, can reference it just as if it was an existing table, right? So in this case here, what will happen is I generate a CTE that invokes the query select one, right? Selects don't have to have a from clause, this produces a single tuple with a single attribute with the value one, and then this other query at the bottom to select star on it, and this outputs that single tuple. So I can start doing more complicated things. I can now bind the name or the output columns from the queries inside of the CTE to given names, and then I can reference them by that name down below in my statement at the bottom. So this CTE produces a single tuple with two attributes with the values one and two, and then these get mapped to the name column one column two, and then in the select statement below, I can just add them together, right? Pretty straightforward. So let's go back and try to do that example we did before, where we wanna find the name of the student with the highest student ID that's enrolled in least one course. So for this, I have my CTE, and inside of that, I'll do my max on the student ID on the enroll table. Again, that produces one tuple with that max student ID, and then down below, I can do a join where I'm gonna reference the CTE that was generated above me, and I just combine together the max ID that came out of this, which is defined here with my student ID, and then that produces the one tuple that I want. Right, again, this is another example of how to do the same query, at a high level, semantically, the same query, just written in different ways. So now you may be saying, well, how is this any different than a nested query, right? The answer is, you can do recursion in a common table expression that you can't do in a nested query. So bear with me here, but this query is gonna produce a sequence of numbers from one to 10, like a for loop, right? So now we have, with recursive, we add the recursive keyword, and then inside of our CTE definition, we're gonna do a union between a single query that produces the value one, a single tuple with a single attribute one, and then we're gonna then union that with another query that actually references ourselves. So seeing here, we're gonna vote on our own CTE, and then we're gonna take whatever the counter is that's produced by this and add one to it and produce that as the output, right? And we keep running this until our where clause actually gets tripped up where we try to go above 10, and then now we stop producing tuples, so now we have the answer that we want, that we can then reference below in our CTE. Yes? It's basically a union, right? It's union with duplicates. Union by default, sorry, sorry, it's union without duplicates, union with duplicates, union without all strips, keeps duplicates, union without all removes them. Might be the way that we can test that though. Okay, so let's do this in Postgres. So for the, right, so this is the one query that, this is the query that I showed you, right, and we produce the list of tuples, right, with a single value one to 10, right? Let me try to kill this. All right, is that better? All right, but you'd be careful about CTEs with the cursive because, again, because you can essentially have infinite loops. So in this query here, I no longer have that counter where it's greater than 10 or produce, check any tuples where they're less than 10. This will run actually forever. But what I did first was I defined, I told Postgres to, again, I'm highlighting here, you can't see it. I told Postgres that any query that runs longer than 10 seconds, automatically kill it. So when I ran this query now, it no longer has the where clause to prevent it from going, looking at tuples beyond 10. And so essentially it runs forever. Postgres recognizes that we're stuck in a query that's taking too long and it goes ahead and automatically kills it. Yes? So select one just does this, right? So now if I do select one union, all, select one, right, I get like that. So if I think I remove the all, yeah, you get, it removes duplicates, right? So union all will give me everything. So essentially what's happening now is I'm calling my CTE and say whatever tuples you have in the current value, take the output and then add one to it here, right? The plus one there. So this is invoking on our source CTE called source. For every single tuple in there, take its output, add one to it. So in order to get that tuple, it has to go then invoke the CTE which then produces back one. Okay, I'm going to put it in here and then take that out. Yeah, yeah. What will the other thing do is write the, it won't let you invoke it because it knows you're trying to reference yourself. So this question is what happens if you try to remove the recursive clause? Again, it's SQL, it's declarative. We know everything you're trying to do. So you're trying to access a CTE table that's defined by yourself and it doesn't let you do that. Procursive lets you do that. So let's try it now also with, let's try it by removing the union all. Let's see what happens. The all part. It still produces the correct answer. Yes, you guys, you don't need the union all, right? So in this example here, again, I'm synthetically generating a table that has a single value of one and then I invoke a query to get that tuple and add one to it. But then it generates a new tuple which then I can then invoke again and add one to that. And I keep doing that until I don't produce any more matches, right? Because essentially the recursion ends because I've reached my limit of 10. Yes. Why do you think there is a CTE table that doesn't do anything at all? Why do you think there is a CTE table that doesn't do anything at all? The first thing that you're doing is you have to line the tuple and then you have to do anything at first. The first thing you've got to line the tuple and then you have to produce the next one. So this question is, why are there not duplicates? Because the first time you invoke it, you have one. The second time you invoke it, you would have one and two. Let's do this offline. Let's walk through it. Because again, we're out of time. All right, yes, quick. Is this a call to use CTEs? Do people use CTEs supposedly? This question is, do people use recursive CTEs? Is that very common? Yes, absolutely, yes. So this actually finishes up very nicely. So the second point I'm trying to make here is that in an ideal scenario, you always want to be able to compute an entire query without having to bring any data locally or without having to go back and forth. So it also provides, by running as a single query, although the complexity, it makes it harder to do query optimization, but if you tell the data system everything you want to do with this sort of piece of data, then it can do a global optimization on that. So by having a CTE, you're not having to have additional logic be somewhere else outside the data system or go back and forth. You're saying, here's everything I want to do. So CTEs are actually very common, especially in newer applications. Recursive CTEs, maybe less so, but definitely CTEs. But it's another way to write a nested query. All right, the other major thing I want to point out is that, again, the language is from 1973 or 1974, but it's still widely commonly used. It's being updated all the time, and learning SQL is important because you're going to see this again throughout your entire life. Pretty much every single system, database system that with some minor exceptions is going to support some variant of SQL. Okay, so real quick, homework one. Immediately after class, I will upload it and I'll be available. Basically, we're giving you a sample dataset of this bike share application from San Francisco. It has a few number of tables on it, and we're asking you to write queries on them. So for this, you can do everything on SQLite. Everybody's laptop should already have SQLite installed, otherwise it's free, it's not hard to get. So you write all your queries with SQLite locally and then you upload them in the grayscope and it'll run SQLite there and do a diff to check whether you have the result. Okay? And it'll be due on September 10th at midnight. All right guys, have a good weekend. I'll see you Wednesday next week. Monday is a holiday. That's my favorite all-brad. All right. Yes, it's the SD Cricut, I-D-E-S. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the Tio. Now here it comes, Duke. I play the game where there's no rules. Homies on the cusley, I'm a foocuss, I drink brook. With the bus a cap on the ice, bro. Bushwick on the goal with a blow to the ice. Here I come, Willie D, that's me. Rolling with fifth one, stop pulling. By the 12-pack case, I'm a four. Six-pack, 48, gets the real bounce. I drink brook, but yo, I drink it by the 12 ounce. They say bill makes you fat. But saying eyes is straight, so it really don't matter.