 All right, so in today's class, I want to start talking about what we call advanced SQL. All right, again, the first homework assignment requires you to do, you know, write SQL queries. But given that SQL is 40 years old, there's a ton of tutorials on the Internet, there's the textbook, and a lot of you actually already have SQL experience. I'd rather not spend today to teach you the basics of SQL. I want to teach you the more complicated things you can do with it, the more interesting things, and then, you know, have you learned to sort of understand the power of what SQL can do. So before I begin with that, I just want to make two quick announcements. If you're interested in database research, which is my area, there's two additional opportunities you can have this semester to check things out, to learn what's going on in sort of the state-of-the-art database systems. This course is sort of designed as a classic database system course, meaning we'll teach you sort of the basic architecture of how they design stuff in the 70s and 80s. But if you want to learn more state-of-the-art things, then the two opportunities are to come to the database research group meetings, which are on Mondays at 4.30 in Gates Hall on the 8th floor. And then we're also having a special seminar series this semester. We're inviting speakers from time series database companies. The time series databases are systems where you ingest, like, streams from IOT devices or other things, and you want to store them as sort of a series of events, and you want to do queries on top of them. So we have a bunch of speakers coming throughout the semester from the top time series database companies. We're asking the founders and actually the guys that are actually building the database systems to come talk to you about how these systems actually work. So again, both of these are entirely optional. I'll send an announcement on Canvas. These will also be posted on Panopto and YouTube. So if you want to sit in your dorm room by yourself and watch these, you can do that as well. So real quickly, before I get started now talking about the relational language and SQL, I just want to say that two things about my lecture style. So the first is that I will not answer any questions at the end of the lecture about anything about the materials I covered. So that means that if you have some question, you hold it into the very end and you run down and try to ask me, I won't answer it because I'd rather you raise your hand and stop me and ask the question during the lecture. Because if you have a question, then probably somebody else does, right? So I prefer if you did that. The second thing is that I've also been told that I speak fast. I get very excited about databases and I start speaking very quickly. So again, if I'm going too fast and you don't understand something, just raise your hand and tell me to shut up and slow down or explain something, right? Because if I'm speaking too fast for you, then I'm probably speaking too fast for somebody. I'm aware of this problem. I've gone to counseling and I'm trying to make an effort to not speak fast. But as I said, databases is the only thing I really care about in my life other than my wife. And so I get excited and I start speaking fast. That's not true of the dogs, number three, but that's all right. Okay, so in the previous lecture, I posted it yesterday, which I realized is a bit late. So if you haven't watched it yet, there's nothing that we're going to cover here in today's lecture that depends on you knowing exactly what I talked about in the last class. But I'll just say that what I showed in the second lecture was these sort of mathematical concepts of how you can ask questions on a database system or on a relational database. So we looked at relational algebra and we looked at very briefly, relational calculus. And so obviously you don't write programs in relational calculus or relational algebra. You need sort of like a programming language that it's easy for humans to write. And so this is essentially what SQL is, right? SQL is the language that you can use in your program to write queries on a relational database system. And the key thing to understand about SQL, and in fact all relational languages in general, is that they are declarative as opposed to imperative or procedural, as you normally would do in Python. So what I mean by that as a declarative programming language is you as the programmer only have to tell the database system what the answer is that you want and not how to compute it. So you remember from the second lecture when I showed these relational algebra operators, we just defined them as sort of low level primitives to read data and perform filters on our relations. We didn't actually specify anything about how we wanted the database system to do that. There was no for loops. There were no built in hash tables and other things that you would normally have in a programming language. We only had to tell the database system what we wanted. And in relational calculus, the tuple relational calculus was sort of an example of this. We just said exactly what we wanted. We didn't specify any steps. And the database system would have to figure out what the best way it is to execute it. So that's the main win you're going to get with SQL as opposed to using Python or C++ to program on a database. And so that means from our point of view from actually the internals of the database system, because that's what we care about in this course. We care about building the database system, not just using it, that means it's up for us as the database developers to implement a query optimizer inside of our system that can take a SQL query, convert it into the relational operators, and then generate the most efficient plan that we can find to how to execute that query efficiently. So every time you open up SQLite, or MySQL, or PostgreSQL, or any relational database system, and you give it a SQL query, it's going to run through this very complex piece of code called the optimizer that tries to figure out the optimal plan for this. And it's actually quite amazing what these things can do. They can run these things in microseconds or milliseconds. For more complex queries, they take a bit more time. But we have to do this because the user's not telling us exactly what algorithm to use. We have to figure out the right strategy on our own. So we'll cover the query optimizer in a few weeks. I'll just say that this is sort of the black art of database systems. This is kind of the piece of these systems that, at least in the commercial systems, that they keep very secretive because they've spent hours and hours and hours, millions of dollars, multiple years, building these things. And this is really what separates the commercial database systems, which are very expensive, from the open source guys. So the SQL server query optimizer is actually really, really good, and much better than MySQL, PostgreSQL, and SQLite. Because it had really smart people with PhDs and a lot of money spending a lot of time to improve it. So we'll cover that later when we talk about query optimization. So the SQL language itself, as I said before, defines, has basically two categories of operations you can do in it. So the first is the DML, the Data Manipulation Language. So these are your selects, insert, updates, elites. And then you can also have the DDL, or the Data Definition Language. And this is how you define your schema for your database. Create table, create index, create constraints, and things like that. There's some additional things to create views. And there's things you can do transactions. We'll cover those later. But this is everything you need to do to program on a database can be defined in these two categories. So now one very important thing that I need to bring up, and it'll come up throughout today's lecture, is that SQL is not based on sets the way relational algebra was. Instead, it's going to be based on what are called bags. It's an unordered collection where you're allowed to have duplicates. A list has an order, and you can have duplicates. A set has no order, and you can't have duplicates. And a bag is no order with duplicates. And we do this in SQL because it's actually expensive to remove duplicates. There'll be a keyword to actually do this, but in practice, most people don't actually need this. And so it's not worth the extra overhead to figure out to prune your duplicates when you run your queries. So again, this will come up when we go through a bunch of examples. Just be mindful of that SQL by default will allow duplicates, whereas relational algebra does not. So the history of SQL is actually kind of interesting. So again, if you hear me saying, I call it SQL, but the letters are SQL. Some people say MySQL, some people say MySQL. And the reason why I say SQL is not that I was allowed in the 1970s, but when the language first came out in the 1970s, it was written as S-E-Q-U-E-L. Instead of the structured English query language. So SQL was developed by IBM Research when they were building one of the first relational database systems in the 1970s. So again, Ted Kod wrote this amazing paper on a relational model, but he was a mathematician. He wasn't a programmer. And these people in San Jose saw the paper and said, hey, let's try to make one of these database systems. And they basically took his paper and actually made a software that was based on the relational model. And so one of the things that they had to develop was the query language. Because in the relational model paper from Ted Kod, he didn't specify any query language. It was all mathematical notation for relational algebra and relational calculus. He then later proposed a query language called alpha, which never actually got implemented. But these IBM guys that were building system R, they had to invent their own. And they came up with SQL. So system R is going to come up multiple times throughout the semester, as well as the other early relational database system, Ingress, which came out of Berkeley. If you know Postgres, Ingress is the predecessor. Postgres is Post-Ingress. It's the same guy that invented both of them. So the Ingress guys had this other language called quell that was sort of based on Ted Kod's language. And at the time of the 1970s, SQL and quell were sort of considered equivalent. They were sort of equal competitors. The other relational database that came out later in the 1970s was Oracle, which is one of the most famous database companies. And when they were building Oracle, they actually borrowed a lot of ideas. I don't want to say copied, but borrowed a lot of ideas from IBM, because they would literally call the researchers at IBM and say, hey, what does your thing do if you give it this? What happens with that? And they just reimplemented the same thing in Oracle. And so what happened was, when IBM never actually really released System R publicly, because they still had IMS, they were still making a lot of money with that. But then by the late 1970s, early 1980s, it was sort of obvious that the relational model was going to win. So they released a new database system called DB2, which is still out today in 1983. And it supported SQL, because IBM invented SQL. And since IBM was sort of the computer juggernaut at the time, because they supported SQL, that became the de facto standard. And Oracle was sort of the right place at the right time. And they supported SQL. They were compatible with IBM. And sort of SQL won. So that's why we use SQL today, not Quel or Alpha. So it was ratified as an American standard in 1986, and then became an international standard in 1987. And I think some guy in London out of the copyright or trademark for SQL written out as EQEL. So they had to shorten it to SQL to avoid a lawsuit. Even though the SQL is 40 years old, it is not a dead language, and there's updates to it all the time. So the latest version of the SQL standard actually came out last year in 2016. And as you can see from the timeline here, every so often there's a new update with new features and new functionality. Typically what happens is all the major database vendors get together, and they all have some proprietary SQL features that they added to their system. And they try to push the standard body to adopt it. Sometimes it works, sometimes it doesn't work. So the latest version, again, is 2016. This added support for JSON and polymorphic tables. And as you can see, as you go down, there's sort of all these new features that gets added to it. So if you're going to build a new relational database system and you want to claim that you support SQL, the bare minimum you need to have is what SQL92 is in that standard. And that's your select, insert, update, delete, basic transactions, your aggregation functions, everything like the bare bones you need to support SQL is defined in SQL92. And so there's a nice website here. It's a few years old now. But in this website here, the guy actually goes through the major database systems and shows how, where they follow the standard and where they're incompatible. So this will be come up multiple times. And I'll show examples of this in this lecture. But although there is a SQL standard, almost nobody actually truly follows it. Everyone has their own little proprietary nuances and differentiations. In my opinion, Oracle and PostgreSQL are probably the follow the standard the most. My SQL, historically, is the biggest offender. They've gotten better in recent years. But when everybody, all the other databases would implement a function one way or have it called one thing, my SQL would come up their own term. But like I said, in the newer versions, they're slowly fixing these things. So for today's agenda, as I said, I want to focus on more advanced features of SQL that you're going to need to understand and know in order to complete the first homework assignment. And again, in the textbook, you can read how to do basic selects, inserts updates, deletes. But I want to focus on things that sort of go beyond what they talk about. And in particular, the one that you're going to really need for the homework is CTEs, or common table expressions. And I'll give a little demo as we go along to show you how PostgreSQL SQLite and my SQL supports these different operations and where they deviate from each other. So for this, the example database I'm going to use in this lecture is a simple university application that has three tables. We have a student table with a student ID named login in GPA, a course table with a course ID and a name, and then we have a cross-reference table called enrolled that maps the student ID to the course. And then lists what their grade is. And again, I'm going to use this as the example for every slide. So the first thing we have to talk about are aggregations. So an aggregate is a function that's going to take one or more tuples, a bag of tuples, and then do some kind of computation on them, and then produce a single result. So again, say you take the first one, the one at the bottom, count. So the count function will count the number of rows that you have and produce a single scalar value that gives you that count. Min and max are sort of obvious. Some add together all the values of the column, and average takes the average of them. So these sort of five functions here, these are defined in the SQL 92 standard. But there's some additional ones, as you can expect, like mode, median, standard deviation in the newer versions and other database systems. So let's look at an example like this. So you put the aggregation function in the output clause of the select statement, and so this example here, we want to count the number of students that have a login that ends with at CS. And then this will just get outputted as a scalar value, as its own tuple. We can also rewrite this in another way. So in the first example I showed, I'm counting the number of logins. But this is actually kind of like unnecessary, because we don't actually care what the contents of the log in field, we're just counting the tuples that we find. So we can actually replace the actual field log in inside the count and just put a star. That's considered equivalent. Or in another way, we can just put one. For every tuple we find, we just count one and add it up. So this is a short example that the same SQL query can be written in a bunch of different ways. And this will come up, and we talk about more complex things. But this is sort of essentially what the query optimizer is going to try to figure out to do, is how to rewrite your query into a more simple form that's easier to calculate. Let's look at another example. We want to get the number of students and their average GPA that have the at CS log in. And this shows that you can add, you have multiple aggregation functions together in your output list. And it'll compute those sort of independently with each other. So this will, again, do our count of all our student IDs, and then it takes their GPA and it computes the average. And you get a single tuple result like this that has separate values for each of our aggregation functions. All right, so this is pretty simple. You do more complex things. You actually throw the distinct clause in there. So this is going to count the number of students, the number of unique students based on their login address. And this is essentially, again, just finding the duplicate logins, collapsing them to one, and only counting on them one time. And you're going to use a single scalar like this. All right, so the thing to be mindful of about aggregations is that you're often tempted to get more information about what the aggregate you're computing. So say in this query we want to get the average GPA of all the students that are enrolled in each course. And so we want to include maybe the course ID, because otherwise the average, it doesn't make any sense. So this actually won't work, because the course ID is undefined for this. So actually this is, most database systems will say you can't actually run this query. Because you think about doing, you're collapsing multiple rows and computing the aggregation on the GPA. And what are you supposed to do with the course ID in each row? Do you take the first one? Do you take the last one? Do you take a random one? MySQL used to give you a random one, and now they fix it and throw an error. And so this is actually also not the answer we were looking for either, right? Because we want to take each course and compute their average GPA per course. And so to do this we can use the group write function, the group write operator. So now this is the exact same query they have before, but now I'm adding group write at the end on the course ID. And then that will take all the joined tuples from enrolled in the student, and then basically take the course ID of the tuples and put them in buckets, where they have the same value. And then for each of these buckets we'll compute the GPA on them, right? So this again, so this is going to allow us to now get categories of aggregations across multiple tuples using the group write clause, right? So the key thing to point out is that you can't have anything in your select output if it's not included in the group write, and it's not part of an aggregation function, right? So here I have, for whatever reason I want to do, I want to get the student's name per course. This will actually throw an error because it's going to tell you that it doesn't know how to do anything with this tuple because it's not included in your group write clause. So if you just add it down here, then it can actually compute the correct answer. Is this clear? OK. So now you actually want to filter your aggregations after you've actually produced them. So here again, now we're going to compute the average GPA on students, and we only want to get the courses where the average GPA is greater than 3.9. So you feel like you want to put the three point, in your where clause, where average GPA is greater than 3.9. But this won't work because when it's actually running this query, as it's doing the filter inside the where clause, it actually hasn't computed the GPA yet. So it doesn't know how to run this predicate. So this will throw an error. And so what you really want to do is a having clause. So this basically says, I'm going to compute my aggregation first, then after I generate my intermediate results, I'll run through this having filter, and this will prune out anything that shouldn't be there. So you have to use having to reference anything you compute in your aggregation after you've computed it. Yes? Your question is, what happens if non-aggregated attributes? What do you mean by that? I just figured, after the select, could you please go to the previous slide? This one. A previous slide. And non-aggregated value in the select must be appeared. I mean, if they happen to be same. So what do you mean by the same? The same name? Yes. OK. Happen to have a sex, and all of them are male. Yes? And I just select average s.gba male as sex. So your question is, I think you're asking, if I do a group by on, say, the sex column. It must appear group by equals, right? Yes. But sex, I didn't group by sex. But they happen to be same, I just put it in the select. So what do you mean by this? Like they have the same value? Same value. Yeah, but the database system doesn't know that, right? You know that because maybe you wrote your application that way. So his question is, is there any optimization you can do where even though the values are, even though the column is not included in the group by, but they all have the same values, can then still put it in the where clause? And the answer is no. Because remember, we're doing this, like, when the query arrives at the database system before we actually run anything. So the system doesn't know that it may be the case that we could run the query you want and actually produce the correct answer. So it's doing this check beforehand. And this is the nice thing about a declarative language, because we can check this before we actually haven't run it. OK, so we fix this with having, and we get the result we're looking for. And as I said, the way the system is going to run this is compute the aggregation first, and then do the additional filtering afterwards. All right, so now we can talk about SQL operations. And this comes up in the first homework. So the SQL standard says that all strings in your database are var char fields, char fields, text fields. Any string has to be case sensitive. And the way you define them is using single quotes. And for the most part, most of the data systems actually follow this. The black sheep are MySQL and SQLite. So in SQLite, it's case sensitive as the way it should be. But then they allow you to have single and double quotes. MySQL, at least as a 5.7 lesson I checked, it's case insensitive, and you can have both the single quotes and the double quotes. So I would always try to, even using MySQL, always try to use single quotes, because it makes your database code more portable. I have to admit that when I first started using databases on MySQL 3, you always use double quotes. And now when I use other systems, I'm always have to go back and correct myself, because it's a force of habit to always put double quotes. And here it's just showing that, again, in MySQL, we can run this query, and we can have any case of Kanye's name, and it'll match, whereas in the SQL standard, you actually uppercase everything if you want to do an exact match like that. So we want to do pattern matching in SQL. We use the light keyword. There are extensions in SQL that support regular expressions, but most of the times this is the most simple thing you need to use. And so you have the light keyword, and then you give it a percent sign to mean one or any number of characters can match. I think of this as the dot star and regular expression. And then if you want to match a single character, this would be just an underscore. I don't know why they didn't choose star back then, but it is what it is. So now we can also have a bunch of string functions as well. So the SQL standard defines some basic string operations, like string length, substring, and other things like that, lower upper. And of course, all the database systems have their own proprietary string functions and extensions. The key thing I just want to point out, though, is that you can invoke a string function anywhere in the SQL statement you have a string, or varchar. So it can be in your where clause, or it can be in the select output. It doesn't matter. To concatenate strings, you want to use the double bar. So in the SQL standard, it says this is what you should use. In SQL Server, from Microsoft, they use the plus sign. And in MySQL, at least as a 5.7, they only support the concat function. I don't know why they don't support the more simple syntax, but if you need to concat strings, this is what you have to use. SQL Lite will use the double bars. What's that, sorry? OK. All right, so overall, strings are usually mostly compatible from across the different database systems. The date and time functions are pretty loosey-goosey in each system. They're all much, much different. And this sort of sucks, right? Because you think about it, SQL's supposed to be this sort of universal syntax, universal language, to communicate or write programs or write code on a relational database system. But there's so many different proprietary things that it makes it very hard to have your code be portable. So you have WordPress, for example. WordPress only runs on MySQL. If you want to use another database system, it's a lot of work to actually convert everything. So the date and time functions are probably the worst parts, right? And so date times, they have these different operators you can do in different, in either the output or the predicates, just like we did with string functions. But things are much, much different. So I want to go a quick demo of this. So this is three different database systems running on my laptop. Can everyone see that? Or should I make it bigger? Is that better? So we have Postgres at the top, MySQL in the middle, and SQLite at the bottom. So all of these should support the now function, right? And that'll give you the current date, except for SQLite. SQLite doesn't have it, right? So instead, they have something called current time stamp, right? Again, you see that it's sort of an alias for now, because the output said it was going to be now. But then MySQL has current time stamp as well, right, and get that. But before, I ran now as a function, right? Now I'm running current time stamp without a function, right, without the parentheses. But I can also run it as a function, right, and get the output. So let's go back to Postgres and let's try that. Not defined, all right? So you can use the current time stamp as a keyword, but you can't use it as a function in Postgres. Let's see what SQLite has. So who says it will have the keyword? Yes or no? Raise your hand if you say yes. Raise your hand if you say no. Those who know the back said no. All right, they have the keyword. They have no function. OK. So let's do an example now where we try to compute the number of days from today since the beginning of the year, right? And so we can use an extract function. I'll use Postgres because Postgres follows this, sorry, follows the standard pretty well. And we can invoke this as extract function and we can manipulate our date fields in SQL to extract certain things. So this is telling you take the string 2017, 0906, which is today, convert it to a date type, and then we invoke the extract function to extract the day, right, which is the day of the week. And so we can try this in my SQL. Oh, let's see what it does, right? It gets the same result. We'll try it in SQLite. SQLite doesn't have it. OK. So in Postgres, getting the number of days since the beginning of the year is actually really easy, right? We take two dates, right, today and beginning of the year, and we subtract them, and it gives us the number of days, 248. Who says this will work in my SQL? Yes or no? He says yes. Who says no? Most people, right? The answer is sort of, right? It ran, but what, it produces 805. I'm not even sure what that means, right? It's not like 248 times 2, right? It's not that, right? So it produced some number. So what we can do now, though, is we can convert the, ah, sorry to say sorry, but you can sort of see the query there. So what we'll do is we'll take the current, we'll take, convert the string for the dates, convert it to the Unix timestamp, which is the number of seconds since the Unix epoch in 1970, all right? And then we'll round them to be integers and we'll subtract them from each other. But then we'll divide them by 60 seconds times 60 minutes times 24 hours, right? And then it produces 248. And we get the answer we're looking for. So this is one way to do it, but my SQL actually has an even easier way. They have what's called a date def function, right? So this is essentially doing the same thing we did in Postgres where we subtracted two things, but they have it to do it as a function. So Postgres actually has a really interesting type system, which we'll talk about later in the semester or later in the course, where it allows you actually to find your own types. And then sort of like you can do in like C++, you can then override the operators, like plus, minus multiplication to do whatever it is you want to do. So that's sort of why they have a nice, you know, you take a data object and you can subtract from each other and you produce the answer you're looking for. So of course none of this is going to work in SQLite. I took a while, I'm not going to show you how to do it. I figured out how to convert the time stamp into the Julian calendar day, which is the number of days since the beginning of the year, and then you can subtract the two of them, right? And we can round that up and there you go, the right answer, 248. So again, a seemingly simple thing to do, count the number of days since the beginning of the year, has to be done in completely different ways in SQL for all these different programming languages. OK, so now we can also talk about output redirection. So in all the examples I just showed, when I opened up the terminal, we were running some SQL query and it would dump out the result to the terminal. But maybe what you really want to do is actually keep the result of a query still in the database system and then run additional queries on top of it. And we want to do this because we avoid the problem of maybe running some query, getting the result in our application, which might be running on another machine, in another data center, and then sending it back up to actually install it as a table. We can define a single query that takes the output and puts it right back into our database. So the SQL standard says you do this by adding the into keyword, into your select statement here. And of course, my SQL has to be different. And what they do is you can create a table and then inside of the parentheses where you would normally define your columns in your table, you just put a select statement. And in whatever the output of that select statement gets inserted into this table. So with the same column names, the same types, and same everything. But let's say that you already have a table that exists and so you can't use the select into because that will actually create the table. You can actually use inserts where you put in the parentheses, the actual query you want to put into it. And so the key thing about this is that the select statement has to generate results or tuples that have the same number attributes or columns as the table you're inserting to, and they have to have the same type. Because otherwise it'll throw in air. There's also other complications you have to deal with if you have constraints on your tables where you don't want to allow duplicates. And the SQL standard, I don't think actually specifies what should happen, and all different databases do different things. So let's say I have a primary key where there's one field that always has to be unique, and then I run an insert query like this where I'm selecting 10 tuples and want to put it into this other table. Let's say, though, that there is a duplicate key in the table that I want to insert into. So should the database system only allow you to insert 9 out of the 10, should it not let you insert any of them, should it throw in air? All those different things are left up to the implementation of the database vendor. And as far as I know, the database systems do different things with this. As I said, since SQL is based on unordered bag algebra, it's often the case, though, in your application you need to show things in assorted results. So we can use the order by clause to take any SQL result as any query, and then order by the columns that we specify in our output list. We actually do more complicated things, and you get a result like this. You can do more complicated things as well. They can have multiple columns referenced in your output clause. You can actually put any arbitrary expression you want in these. They can put 1 plus 1 and sort by that. They can put anything in these things. And it'll sort based on the first one, followed by the second one. And the other key thing to point out here is that, unlike in Group By, where any column that appeared in your output clause had to appear in your order by clause, you can actually not include anything. And you can have something in your order by clause that's not in your output. In Group By, you wouldn't be able to do this. So this is telling us to sort the results of this query based on the grade field, but in our output we don't actually show it. Because this is because we're going to do the sorting first with the column, and then we do a projection to prune out the things that we don't need. Yes? How does the order by a 1 plus 1 help you mention that you were talking about? So this question is, how does the order by 1 plus 1 work? So let's try here. Let's do Postgres. All right, so we have, say, the Enroll table. So we can just select star from Enrolled. Is that big enough for everyone? See that? Stacks are from Enrolled, where, actually, when I remember where our clause is, you order by student ID. And as expected, we get our things sorted in the column here. So I can do, in theory, 1 plus 1, and it took it. So how do they sort it? Undefined, because what's going to happen is the order by essentially is going to loop through every single tuple in your output. And it will look in whatever the order by clause is and say, well, what's the value of the attribute that you're referencing, the expression you're referencing in this first order by clause? And in this case, it's 1 plus 1, and the answer is 2. So that means it's sort of thinking of it as there would be a virtual column for every single tuple that represents the order by expression. And the value is always going to be 2. And therefore, 2 always equals 2, because every tuple is going to have 2. And therefore, it just gives you whatever order is there. Is this, yeah, so what do you mean by original table? So his question is, and actually, if you look at this, so if I do order 1 plus 1, I get this order, and I get rid of the order by, it doesn't change, right? It gets the exact same order, right? So his statement is, is this in the same order that was inserted? And the answer is yes, but no. Yes, because that's what it did. But no, because, again, relational database, relational model, is unsorted. So just because we inserted things in one way at one point in time, doesn't mean the database system can't come around and insert things in another way later, or reshuffle things. And it's still considered correct. If you care about the order, you add the order by clause, right? So let's try this. Let's try order by SID, and I'll append X, Y, Z to it. No, it's OK, right? Again, I had my expression, it took the student ID, appended X, Y, Z to it, but since it's going to do this for all of them, it ends up being the same. Does that answer your question? OK, cool. All right, you also sometimes want to limit the amount of output you have. And to do this, there's a limit clause. And all you have to do is either you put the number elements you want to produce as your output at the end. Yes. The question is, there's two columns in the order by clause, and that's what? Correct, yeah. So his statement is, I have two clauses, two columns referenced in my order by clause. What's the purpose of this? Yes, this would actually be used to break ties. So the way to think about this, I'll sort everything by grade first, and then produce buckets, and then for each of those buckets, I'll go and then sort them by the student ID. Yes. Yes. My ID somewhere was 62-3-4, all the rows in the table. Yes. But should them all have a specific total ID within a grade? So his statement is, I made a previous statement that the data system is allowed to reshuffle the order in memory or in disk of how tuples are stored. And then your statement is, aren't tuples going to have a record ID? Yes. And again, this is, for our purpose that we're discussing here, we're not talking actually how these things are actually implemented. So we're dealing with SQL at the logical level, not actually the physical level. So an internal row ID, which in a disk-based data system is usually the block ID in offset. Yes, in some ways, in most data systems, that will be static, because it's always been the same block reference. It's always going to be the same offset. But typically, the row ID would not be exposed to you as a SQL developer, a SQL programmer. It's just using that internally for bookkeeping. In terms of the output, the output is what we see in these results. We don't know anything about how things are actually stored on disk. Another way to think about this is, in some database systems, let's say I insert a bunch of entries, and then I delete an entry. Instead of just having a hole in my page where I can't use it anymore, the next tuple that gets inserted would actually reuse that slot. So now when I do my select and I get things in the order, I'm actually going to get that guy that was inserted last in the middle because I've used the slot. We can try with Postgres to force it to do that later. You see how that works. But again, you should not be writing your programs in a way where you assume the order of the output. If you care about the order, you have to add the order by clause. So limit the number of entries we have. We can also use an offset at the end to say how many elements we should skip over. So this is going to skip the first 10 elements and then grab the next 20. So this is actually how you would implement certain websites where they show you a listing of results and there's a button to click the next page and it shows you the next 10 items. They're essentially running this query like this here. So that's what everything I described right there. That's pretty much basic SQL. And I wanted to sort of focus on the aggregations a little bit because it'll come up when we talk about window functions and CTEs. So now at this point we're going to actually start talking about more complicated things. So nested queries are a query where you basically have one query embedded inside of another. And the internal query, the inner query as they're called, can essentially appear anywhere in your select statement here. So here I'm going to select the name from the student where the student ID is in and then I have my parentheses and this essentially is going to be a sort of almost like a function that's going to produce a result of tuples that I can then use in my predicate to do my evaluation here. So nested queries can be sometimes difficult to write but sometimes it's the only way you can write certain queries. Actually that's not true. There's other ways but it's often the most natural way for us as humans to write SQL queries. But I will say is that these things are difficult to optimize as we will cover when we talk about query optimization. In this example it's pretty simple because it's just two queries. But you can have sort of nested queries, 10 or 12 queries deep. And these things are really hard for the query optimizers to reason about. Because essentially what the query optimizer wants to do is it wants to try to generate the most efficient plan. And just because you wrote it in a way that's kind of stupid, it can possibly come up with a better way. So if we took this particular query and read it as face value, select an ID where the ID is in and then this query here, from a human standpoint you can think of this as like a for loop on the outer query where you're going to go through every single student tuple and then you're going to rerun this inner query, get all the student IDs and then check to see whether you have a match. This is actually what MySQL used to do up until a few years ago. And it's obviously a better way, one easy way to do this more efficiently is run this query once, store it in a temporary table, then do a join against it. And that's essentially what a query optimizer is going to try to do. But if you have really deep nesting in your queries, it can actually be pretty difficult. But for our purposes here, again we're just focusing this on what the syntax looks like. So let's look at a more complex example. Say we want to run a query, we want to get the names of the students that appear in 15, 445. And so the way to sort of construct our query, you want to think of sort of at a high level what's the output we want to generate and then we can go inside now into our nested queries and figure out what is the information we need to produce that answer. So we know we want to get the names of students and then our where clause, we want to essentially get what I've written in English here, we want to get the student's ID from the set of people that are enrolled in this course. And so we know that there's going to have to be some kind of select query here that has to get the student ID and does the where clause define the match that they're in this course. So the question we would figure out is what do we actually want to put around this to actually make this do what we want to do. And for this we can use the same in clause that I showed in the last slide where again I'm taking every student on the outer query, taking their student ID and see whether it matches on the internal set there. And again the database system is going to try to rewrite this to either be a join or a temporary query with a temporary table, but for our purposes here we can write it like this. And so the key thing to point out here though is that there's scoping involved in these nested queries. So in this inner query we're referencing the student ID and the outer query we're referencing student ID. But the database system knows that the scope of this first one here corresponds to the enrolled table because it's inside of our nested query, the inner query, and then the outer query has the student ID referenced to the student table here. So we'll show some examples in a second where you actually can reference the outer query's attributes inside of the inner query, but typically you can't go the other way around. So before I show you how you can use in those actually three other operators we can use for nested queries. You can have all which basically says that in our expression all the tuples in our nested query have to satisfy their predicate. Any means at least one of them has to and then exists as, we're not actually checking any predicate. We just have to check to see whether we even got a single tuple in our inner query. So in essentially is an alias for equals any. So I'll show some more examples here. So say we want to get the names of students in 15445. And so this is essentially the same query that I showed before, but instead of putting in I'm putting equals any. So now this is saying find every student record where the student ID exists in any tuple inside of my inner query there. And I can also rewrite it like this. So now I'm going to put my inner query inside of the select output. It's no longer in the from, it's no longer in the where clause. And this is basically does the exact same thing, but now we're sort of flip the order of the inner query and the outer query. So the outer query is now doing the look up on the enroll table, and it's going to find all of the records that students enrolled in 15445. But then in my select statement, my select output, I'll run an inner query that can then now do a join between the enroll table, find any matches on the student ID, and then produce that as the output that I wanted. So this is flipping around and reversing how we thought about the query before, but it's actually going to produce the exact same result. So let's try a more complicated example. Say we want to find the student record with the highest student ID that is enrolled in at least one class. So you think you sort of want to write it like this, right? With the join, we just want to get the max student ID and then just grab their name. Will this work or not work? Who says yes? Who says no? Why no? Doesn't notice, sorry, I was going to say what? Correct, yeah, so again, as I said before, we're computing an aggregate and we're trying to reference, we don't have a group by clause, we're trying to reference another attribute of our tuple, which is undefined because we're collapsing all of the entries, all of our tuples, and producing a single aggregate result. So this doesn't work. It will run in SQLite, which we can test. It used to run in MySQL, I used to give demos in the early versions of this course with MySQL 5.6 and it would do the wrong thing, but now as a 5.7 they'll throw an error and say you can't do this. So again, the way we want to tackle this is sort of think about how to construct the query by first starting what's the output we need in our outer query and then figure out what we need to put in our inner query. So we know we want the student ID and we know we want their name from the student table and so we need to figure out how to find the student ID that is greater than every other student ID that exists in our entire table. So we know we need a select statement to go grab all the student IDs from the enrolled because we have to look in the enrolled table because we're going to make sure they're actually enrolled in the course. And we know that the student ID from the outer query has to be greater than whatever student ID we find in the inner query. And so for this, we just now use the greater than equals to and then the all operator. So again, what's going to happen here, it's going to take this full predicate here, a student ID greater than equals to all and that'll get satisfied if for every student ID it finds in the inner query, the predicate is evaluates to true. So if you find one entry with a student that is less than you, then you can stop the inner query right there and know that the predicate returns false. And apparently Justin Bieber has the highest ID. All right, and again, as a sequel, we can rewrite this in a bunch of different ways. So here's basically the same result. And we can go back to use the n clause and now we just do grab the max student ID from the entire enroll table. And we don't need a group by here because we're only computing a single aggregate attribute across the entire table. And then if our student ID from the outer query is inside of our result set, which will only contain one tuple of the inner query, then we know our predicate would evaluate to true. So one thing I'll point out is in this case here, this only works so that you can only use the inner query when you only have one output column, one output attribute. If I had comma something else, or like a group by, I had a bunch of the results, then this would actually throw an error because it knows it's trying to take a scalar attribute, student ID, on the outer query and match it to a single attribute on the inner query. So if you have multiple attributes, it doesn't know which one you're actually trying to use. It'll throw an error. We can rewrite this in a third way, of course, because it's sequel. And this is actually another general optimization you can do if you want to get the max. So rather than computing the max aggregate, you actually sort everything on the student ID and just give it a limit one. And that'll just grab the first one that you find. These are essentially equivalent. Some database systems will run the max more efficiently than the order by, and sometimes it's the other way around. All right, one more example. Let's find a course that has no students enrolled in it. And again, the same thing is we start with the outer query. We know we want to get all the course information. And then we want to find, in our inner query, the courses that don't have any students in the enrolled table. And so for this, we can use not exist, because this says, as long as there does not exist a tuple that satisfies or that is output by our inner query, then we know the outer query tuple satisfies that predicate. So for this, we can rewrite this as select star from enrolled, where course ID equals the course ID in the course table equals the course ID in the enrolled table. So here's an example of where I'm referencing a attribute from the outer query inside of the inner query. But I can't do the reverse. So the scope of the outer query is cast down into the inner query, but you can't go outside of the outer query and reference the inner query. And of course, nobody's enrolled in 15.823, because I haven't taught the course in five years. All right, so any questions about nested queries? They're really, really powerful. You can do a lot of interesting things. Historically, people with DBAs or database administrators would tell you to avoid them, because they're often difficult for the data system to optimize. But sometimes, again, as a human, it's just easier to write queries that way. All right, so now we can talk about window functions. And again, this is another example of a newer feature in SQL that sort of goes beyond what most of the textbook talks about. So a window function is a way to form a calculation or an aggregation across a bunch of rows without having actually do a group by. So think of it as like I can have a bunch of tuples in my output, and I can compute some sort of count, a min, a max, but for a sort of window of tuples without having to collapse them into a single result. And so the syntax you would use is you define your function first and what fields or attributes you want to compute your function on top of, and then you have this over clause that specifies how you want to group your windows in your tuple results. I'll show some examples that hopefully make this more clear what's going on. I said that again, the over keyword will define how we're going to slice our data up. We can also do sorting in there as well. All right, so our window functions could be any of the aggregates that we talked about before, average, min, max, count, sum, but we also have some two additional operators or two additional functions that are specific to doing window operations. So we can have row number where we can specify what order that the tuple appears in your window or the number elements in it, and then rank what tell you what the order position is. So again, as I said in the beginning, SQL and the relational model, they have no notion of order of your tuples, and therefore, they should have no notion of what rank or position you are in your output. This is allowing you to impose ordering on your data in the query results to produce the answer you'd be looking for. So in this case here, say I want to do a select on the enroll table, and then I want to count, I want to know what position each element appears in my output list. So I use the row number window function, and I define that over, here I say over nothing, so I'm not going to group anything up. And I'll alize it to the row num name, and I get an output like this. So now you see in the column all the way or in the side, I'm getting one, two, three, four, five, that corresponds to the position of the row that appears in my output. So again, internally, the database system is not storing a row number for this table. It's only when you actually compute the query that it does actually add it. So the over key work is essentially the group by part, where you can define how you want to group together your windows. Instead of using group by, you use the term partition by. So in this case here, what I'm going to do is I'm going to compute the enroll table, I'm going to get the course ID and the student ID, and I want to know what position the student is per course. So I have three courses, 445, 721, and 826. And in 445 and 721, I have two students, and so they have the row number one and two. And then the last one, they get the row number one. Because I'm taking the course ID, again grouping them into buckets or windows, and then I'm having my running total of the row number that I assign to this output column as I go along. Is this clear? Yes. So your question is, because the data system is not storing row numbers, can it, sorry, what was the last part? His question is, because the data system is computing this row number on the fly, it's actually not storing this in the database, can it change if I invoke this query multiple times and the layout, the physical structure of the database changes, the table changes? For this particular query, the answer is yes. If I cared, I could add an order by clause inside the over part, and that'll sort them in a way that'll be the same every single time I invoke the query. For this particular example, yes. Depending on what tuple appears first in the data structure, it might choose one or the other. Yes. That's a good point. So I also want to say, I have this order by clause here. That's being done after I compute the window function, not before. So if I put, say, order by, actually, we can try this. If I put order by student ID, it might actually reverse the order. We can try that. So let's do this in Postgres, because Postgres, I don't know if my SQL supports it. I know Postgres does. So this is the query that I ran. And you have, again, we have the course number first. And then we have row number one, two, one, two. So his question was, my statement that I was making is that if I put the order by on the student ID here, this order by is being run after you compute the window function. So it's going to be sorting the student IDs afterwards. So even though in my first output here, I have one, two, one, two, one, it's going in the order that they appear when I compute the row number. In my output here, it actually ends up with two, one, two, one, like that. Because it's sorted them after it computed the window function. And so if you want to make sure that you always got the exact same value, no matter how things change, I think you can put order by in here. So now this will be deterministic. No matter how the physical layout of the database changes, you'll always get the exact same result. That's a good point. Well, there you go. Spoiler. All right, you can put order by in there along with partition by. And you can do exactly what I just said. So let's get this. All right, so let's look at an example here. We want to get the student with the highest grade for each course. And now you see why SQL can get kind of gnarly here. So we have an inner query where we're going to do our window function and compute the rank, again, giving the position in the course. And we're going to partition by the course ID, but we're going to rank, order them by the grade. So you compute the rank after you do the partitioning and after you do the order by. And then in our outer query, we're going to bind this inner query as a temporary table called ranking. And then our outer query will reference the field rank that's generated here inside of the inner query. So again, you can daisy chain these things together in the same way we did with relational operators to do additional things on after you compute the inner one. And again, it's up to the data center to decide actually how it wants to execute this. It could just execute this inner query once, right, to a temporary table. It's actually probably what it would do. And then it takes that temporary table and then applies the filter on it. If it was smart, maybe it could do something like, well, I know in my outer query, I have ranking.rank equals 1. So I want to even bother materializing on the inner query. Any table where rank doesn't equal 1, I'll just throw the table away because I know I'm never going to need it. How do rank and row numbers work? The question is, how does rank and row numbers work? So the rank is the order position of the current row. So if you order, rank is done after you order on the inside. Row numbers is usually what you want, though. OK, so any questions about window functions? All right, so now, yes, the back. The question is, if you use rank on the row number here, so that gets 1, 2, 1, 2, and then I put row numbers, the same thing, I get 1, 2, 1, 2. Yeah, let's try that. 1, 2, 1, 2, and let's do rank. 1, 1, 1. Let me get back to you about this. I'm forgetting why I was doing this. Yeah, yes. Yes, it's the rank in the ordering. Not when it appears in the partition. Thank you. Yes, that's it. Is that clear? So row number says, within the window that you define in the partition by clause, what the position is of the tuple. And then the rank says, what's your position in the sorting order? Undefined. Yes, undefined. OK, so the question is, if you don't specify the order by clause, what should the order be? And again, it's undefined. The rank will always be 1. All right, so to finish up, common table expressions. So these are actually very, very powerful. Because it allows you to do things that people don't think you can do in SQL. In SQL, you don't have for loops. You don't have conditionals or control flow clauses. But with common table expressions, you can sort of do these things. So the way to think about what a CTE is, or common table expression, is that it allows you to define, essentially, a temporary table that is only scoped to the query that you're running. So you have this with clause, and then your common table expression, and then this SQL query here would then can reference up into that common table expression. And you can do this as a sort of a global thing, meaning for the entire single tuple, or you can do it on a per-tuple basis on the bottom query. You can sort of think of this as a temporary table without having to define it. So in this example here, I'm defining a CTE called CTE name. And then in the as clause, I have my select statement. And this particular example, it selects 1. It returns a single tuple with a single attribute, and the value is 1. And then down here in my bottom query, I can then reference that table and do whatever it is, as I would, on a regular table, and produce whatever answer that I need to compute. So if you want to actually make it look more like a table, you can actually give names to these columns from your CTE so that you can reference them in the bottom query. So here in my CTE, I'm doing select 1, 2. This is going to produce a single tuple with two attributes and the value be 1 and the second value be 2. And by default, I imagine Postgres will just call the first column 1 and the second column 2. But if I want to use a more descriptive name, I can put before my as clause and parentheses the name of the columns. And then down below in my bottom query, I can then reference them by that name. Yes. Is that even for that? This question is, do I need types for this? No. I mean, again, it's declared of you can infer what the type is because you know what it's trying to do. I can look in the inside of the CTE. It says select 1. Therefore, it has to be an integer. If I reference, say, a table, then I know what the types are on the table that I'm referencing. So I know the type there. So it doesn't bind it at runtime. It actually binds it at the planning time because it knows exactly what you're trying to do. So let's look at an example. It's the same query I had before where I want to find the student with the highest student ID that is enrolled in least one course. And before I showed how you could use the select maximum enrolled as an inner query, now I'm showing you how to do the exact same thing as a CTE. So in my CTE, I'm going to get the max student ID and I'll bind it to an attribute called max ID. And then in the bottom query, I can do a join between the student table and the CTE with source table and reference the max ID that I generate up above. And it produces the same answer. Yes. This question is, how is it different from nesting queries? Right, good, excellent. Because you need recursion. So in this example here, I can have my CTE reference itself infinitely. You can't do that with a nested query. So in this particular example here, you have to add this recursive keyword that allow you to reference yourself. So what's going to happen is inside of my query here, I'm going to reference the counter field that I defined from CTE source, even though I'm defining what CTE source is. So this particular query here, it will produce the sequence of numbers from 1 to 10 because it's going to invoke itself multiple times and add 1 to the counter. The first time you invoke it, you select 1 and it returns 1. And then you do a union all to just union together all the tuples from itself. So this is why this is different than a from a nested query. And this allows you to do certain things that are typically difficult to do in SQL. So say you want to do graph traversal. You have a hierarchical table or a hierarchical schema, like a tree structure. You can use this to walk the graph or walk the tree. It'd be hard to do that in regular SQL because you'd have to define exactly the number of steps you think you're going to have. Whereas this, now you can have control flow to loop through multiple times and find the data that you're looking for. Yes? It's like SQL, do you like schema checking and check checking to make sure in the previous example you compare your ID to a mass value so worked out that you can pass it in. Is that a failure? So his question is, is SQL type safe? Is it going to check to see whether you're trying to do a comparison between two incompatible types? The answer is yes. A lot of times they'll try to be nice for you. We can give a demo. Like if you have the string 123 and you try to compare it with the number 123, it'll cast it for you. But in some cases, it says I can't do that. It'll throw an error. And then it can figure this out before you actually run the query. You don't have to do runtime checks. So I'll give a demo of this CTE in a second, but I want to show the example that he was asking. So he was asking, say I have the string 123 and I want to add the number one to it. Who says this will work? Yes or no? Yes? Yes. All right, let's try my SQL. Select 123 plus one. Who says yes, who says no? Yes. All right, SQLite. No? I did it. But so let's take that same example, right? Let's say I put ABC in this now. Who says yes, who says no? Raise your hand if you say yes. Raise your hand if you say no, right? Invalid. My SQL yes or no? Yep, gives you a warning. You have to go look to see what that was. All right, SQLite. Raise your hand yes. Raise your hand no. So the SQL standard says yes. You should check these things, but it doesn't always do that. OK? So I'm going to show an example of that. So now with CTEs, we have loops. We have recursion. And what's the problem with that? What's that? You can make recursive calls to yourself. What's the problem? Infinite loop, right? So what I'm doing here is I'm telling Postgres that to set a time out that if any query runs longer than 10 seconds, kill it, right? Because otherwise, it'll run forever. So here's an example of a recursive query that doesn't, that has an infinite loop, right? Because I don't have the clause in there that says if my counter that I'm computing goes above the number 10, kill itself, right? So after 10 seconds, it kills itself. And we know this is actually doing something, right? Because we can, this is running on my laptop. And we can run the same query. And we should see Postgres spiking at 100%, right? 100% CPU, right? Because it's just burning through the cycle to try to compute this. And so if we add now the conditional clause, then we need to make sure that we stop after we find all the results that we're looking for. It'll do that, one, two, three, to 10. So I don't think SQL Lite supports recursion. MySQL 5.7 does not support CTEs at all. The new version, MySQL 8, that's coming out in a year. I don't know why they skip six or seven, but it's MySQL 8. That will support CTEs. CTEs are very powerful, they're not very common, but allow you to do certain things. The way to think about this, what it's going to allow you to do, and this is sort of my concluding remarks, is that CTEs are going to allow you to be able to write complex queries that only need to run on the database system. So in your homework assignment, you could write some Python code that does a bunch of selects, brings some data down, crunches on it, and then maybe runs another select and computes another answer. But that's really inefficient because you're going back and forth over the network. I think it's SQL Lite, it's the same process, but you're going back and forth between your application code and the database system code to compute this answer. But with things like CTEs, nested queries, window functions, that allows you to write within a single SQL statement all the logic you need to compute the answer that you're looking for. And we'll see later on we're going to talk about user defined functions, but this is a lot of more powerful things is now they actually have true for loops and you have true programming language constructs that you can embed inside of your database system and invoke them within a SQL statement. We'll see those later. Yes? Are CTEs and nested queries equivalent at the logical level? His question is, are CTEs and nested queries equivalent at the logical level? No, because CTEs can do conversion and nested queries cannot. Nested queries are specialties of CTEs. Do query optimizers actually do common subquery alienation? His question is, do query optimizers actually do specialization or CTEs? They transform subquery into CTEs. I don't know. Yeah, actually, I don't know. We should find out. OK. Any other questions? Yes? It's not going to be like a loop to go as far as to say it's primitively cursive, but it looks like it's always going to have to, is there any answer? Nested queries. Yes. Yes. Correct. I mean, you can't write a nested query that references itself, right? So it's not equivalent. I'm trying to think of a way that you can do this views. Views, you can't do that either. Yes, I think CTEs are a special case because they're recursive. Yeah, I was just wondering. All right, any questions? All right, so homework one is due next Wednesday. We're releasing project number one on Monday. And I'll spend a little time talking about the layout of the code. So everything you're going to do for the programming projects will be based on SQLite. But rather than hacking the internals with SQLite, although as beautiful as it is, it is very complex, we are sort of providing you with a nice little shell environment that can use the SQLite front end, but all the actual internals of your storage manager will be written by you guys. All right, guys, thank you for coming. And we'll see you on Monday next week.