 DJ2PL. Let me make sure I do backup recording. All right, so some quick things, administrative things from last class. It's only been one day. We already got emails. So the first thing is that I made a mistake. Last class, I said Ted Codd got his PhD at Penn. He got his PhD at Michigan. We're all the great daters who got their PhDs. So Mike Stormbreaker got his PhD there. Ted Codd got his PhD there. David DeWitt, another famous Davis guy. So it was not Penn. It was Michigan. All right. The other thing is people complained about the audio. Dear Andy, the audio of your class sucks, so I can't seem you for it. A sound engineer, do it right. Andy, I love the course, but I can't listen to it because the audio is messed up. What happened? Why are you doing this? I lost the way to live. So that was my fault last time. So we're double recording. So hopefully we won't have any issues with posting it this time. And then we actually got emails about you, which is surprising, right? I saw DJ2PL last month performing at the Bridge, 21 and Overshow in Pittsburgh. Is that true? All right. You're very lucky to get a DJ like that. He is expensive. Simu does not have a lot of money. Okay. DJ2PL is ridiculous. Is he single? I'm asking for my friend. And then she's like Taylor Swift without any thumbs. I don't know what that means. Are you single? Okay. All right. So anyway, that's their problem. All right. Cool. So today's class we're going to talk about SQL. The last class we spent time talking about the relational model. We talked about how, in my opinion, that it's a superior data model for every possible database you can sort of think of. The relational data model can be used to represent pretty much all the different schemes that are out there. And then we showed how a relational algebra was the building block for how we would execute queries or define queries to operate over on a relational database. So today's class is really now to talk about SQL, which is a declarative or non-procedural language for interacting with a database system. And what we'll see over time is that SQL has evolved where in the beginning in the 1970s, when it was first defined, it was very strict about what a relational database should look like. But in the last 40 years, it's expanded to support things that don't look relational, like JSON, for example. So let's start at the beginning, talk about how SQL got started. And then we'll talk about the sort of more interesting things you can do in modern versions of it. So the SQL goes back to the 1970s. And again, for Ted Cobb, when he wrote that first paper, he didn't define a programming language for operating a relational database. It was all mathematical. People said, oh, the paper was so inscrutable, no one can understand it. If you actually read it, it's actually pretty easily understandable. It's just people who didn't like math back in the day, I guess. So then some people at IBM saw this paper and tried to start building experimental relational databases to see whether they can actually take his mathematical ideas and put it into practice. And so the very first relational database language, as far as I know, was this thing called Square that IBM invented in 1971. And this was for an earlier project that IBM was developing for one of the first relational database systems, probably the first one, which sounds like a weird experimental rock band. It's called the Peter Lee relational test vehicle. But that was the first thing they built as an early prototype to show that you could take Ted Cobb's ideas and actually put it into a real system. The problem with Square, though, is that you can never actually reuse it because you had to write in weird notation vertically, which you can't really do. So this is one of the original papers. You would write, this is how to do a scan on a sales table by department, you would write in this weird vertical way with characters that you wouldn't have on a keyboard even today. So no one actually ever did this. So then IBM threw that away and they started building a new query language called SQL, spelled S-E-Q-E-E-L, for the System R project, which is a system we'll talk about throughout the semester. But this was the second relational database system that IBM started building to try to show that Ted Cobb's work could actually be done. The Peter Lee one, that was in the UK, that was a small team. The System R project was in San Jose at IBM Research and that was a major, major undertaking. So they defined SQL back in 1972. This was Don Chamberlain and Voice. They work in this query language. And the idea was, it's supposed to be the structured English query language. But in the 1980s, when IBM put out a commercial relational database system, they got sued for the term SQL, like the name SQL, because there was some other system or some other piece of software that was using it. So then they just reverted it back to SQL, just the letters, the structured query language. There was another very famous project at the same time at a Berkeley in the 1970s when System R was getting started called Ingress. Who here has heard of Ingress? Nobody. Who here has heard of Postgres? You want to know why Postgres is called Postgres? Because it's Post-Ingress. The guy that built Stonebrick, when he built Ingress, he commercialized it in the late 1970s and then he went back to Berkeley in 1980s and built a new system that was supposed to be Post-Ingress. That's why it's called Postgres. So they Postgres or sorry Ingress had this other query language called Quell. And so SQL, the plan in the words is that it's supposed to be the SQL to Quell, because the IBM guys knew what the Berkeley people were doing and they were trying to build a better query language. Stonebrick would argue that Quell is better, but of course, no one uses that today. IBM released a couple of relational, in the 1970s, IBM was making a lot of money off of IMS, which is a not a relational system, not didn't support SQL. And then they realized that SQL was going to go somewhere, relational data is going to go somewhere. So they released a bunch of early prototypes like System38, SQL DS, but the big one that really took off was DB2, which is still around today. Again, IBM was a big juggernaut in the computing world, so whatever IBM said they were going to do, that sort of became the de facto standard. So when IBM came out with a relational database that supported SQL, even though there were competing languages like Quell, everyone coalesced around SQL. So SQL became a standard in, an ANSI standard, it was American standards body in 1986, and then it became an international standard in 1987. And so even though it's a 50, 60 year old language now, it has evolved and expanded over time. So the latest version of the SQL standard actually came out in March this year in 2023. And you can see sort of a list here, the history of all the updates, the various features they've added over time. And the main takeaway from this listing here is that as programs evolve, the applications evolve, or the trends in software development has evolved, SQL has moved along with it and adopted the ideas and adopted new capabilities. So in 2023, the big two features that have come out is now you can do property graph queries directly in SQL. So somebody brought up Neo4j last class, that's a special purpose graph data model database system. But now you don't need that anymore, because now you can run graph queries directly in SQL, because the SQL standard supports it. They also add support for multi-dimensional arrays. So I said before that a lot of machine learning stuff is based on arrays or matrices. Now you can operate directly on SQL, these things. Just because the standard has defined, it doesn't mean every system is going to support it. I don't think any system really supports the multi-dimensional array stuff, like Oracle supports the property graph stuff, Postgres will eventually get there, DucDB has a prototype for it. But just because it's in the standard, not everyone's going to actually be able to support it. So I would say in my opinion, the minimum support you need for SQL to say that your database support SQL is defined in the SQL 92 standard. That's like select, insert, update, delete, create tables. That's the basic functionality. So again, even though SQL is over 50 years old, it's not a dead language and there's updates all the time. And of course, every 10 years or every five years, some new technology comes out and people say that SQL is dead and it's about to be replaced. 10 years ago, it was no SQL, and the hot thing now is chat GPT or vector databases. So you see a lot of these kind of things on Twitter or social media where they claim SQL is going to die because chat GPT is going to replace it or natural language is going to replace it. This is all a bunch of hype. It's interesting, but it's not going to replace SQL. SQL was here before you were born and SQL will be here when you die. And I've made public statements basically about this. So there's an article they quoted me in last year and some magazines. I basically said you need to know SQL if you want to do anything in computer science. So in a relational language like SQL, it's going to have sort of three parts. There's going to be the DML, the data manipulation language, that's how we're going to, that's our select insert update delete queries that interact within our database. There'll be the DDL, the data definition language, that's the create table statements, the create views, to create the entities, the objects in our database. And then there'll be, we're not really going to cover this, but there'll be the DCL, the data control language, that's for security and access control. Some systems allow you to have, you can specify what users are allowed to see what rows or what columns or what tables and so forth. So the SQL standard specifies for these things. Another big thing we'll see later in the semester is the definition of transactions. How do you define a bunch of SQL statements that you want to happen atomically in an isolated way? And again, the SQL standard supports this. So again, we'll see bits and pieces of this as we go without the semester. But for today's lecture, we're really going to focus on the first one, the DML. And a reminder from picking up where we were all left, we talked about last class, SQL is going to be based on bags, meaning there could be duplicates, whereas Relation to Algebra was based on sets. And we see some cases where we'll have to add extra stuff in our SQL statements to deal with that. So today, again, it's just going to be a crash course on modern SQL. I'm assuming everyone, whether or not you know it or not, you know enough from the SQL 92 standard, right? Select Insert Update Leads. And I want to talk about the, I want to talk about these are more sophisticated things you can do with them. But the overarching theme also will be is that we will open up the terminal, we'll try a bunch of these queries and different database systems. And we will see that even though there is a SQL standard, there is a, you know, internationally recognized document that says, here's what SQL should look like, nobody implements it exactly that way, right? Everyone's going to have these weird nuances and quirks where they have different features or different nomenclature or syntax to do certain things. In some case, different semantics of different operations, where even though there's a SQL standard, it's going to be different from one system to the next. Who do I think is the biggest offender for the worst SQL implementation? Worse is not the right word, but like the one that deviates from the standard the most. Let me take a guess. All right, the top four out of the, yes in the back, yes. MySQL, he got it right, yes. MySQL is going to be the out of all these where they're going to do all sorts of weird more recently, they now have a flag where you can make it be more strict and try to be more closer to this SQL standard. But for the longest time, they do a bunch of weird things. And my problem is that I first started using databases, relational databases in like when I was in high school in the 90s and we were using MySQL 3. So I have all these bad habits that like I picked up from MySQL and I'm like, oh yeah, this is what SQL is. Then you realize when you start playing other systems, like this is crazy. They're doing some weird stuff. But it's gotten better. MySQL 8 has certainly improved a lot. All right, so we'll go through all of these aggregations, group eyes, the string date time operations, that's going to be the one where we see all the problems. And then a bunch of other different ways to interact with SQL queries. And then another theme about what we'll talk about is the goal of writing SQL statement oftentimes is to try to do all the computation on the database server itself within one sort of one overarching SQL query. I mean, we don't want to have to do a select, get some data back into a Python program or something, then operate on it and then push it back and do more queries. We want to try to do everything we can on the server side instead of the database system. Because we want to be able to push the query to the data, not pull the data to the query. Again, this makes more sense as we go along. All right, so for today we're going to use a simple example database like this. It has three tables, student enrolled in course. It's basically trying to mimic a university. They're students. They take classes and they're enrolled and they get grades in the various courses that are there. So we'll use this as the sample data as we go along. All right, first things, aggregations. So aggregate functions are a way to compute some mathematical computation on a sequence of data or a bag of tuples and you're basically going to call us it down into a single value. So the classic things would be average min max sum and count. You're trying to compute the min value of a column across all tuples within a relation. So a simple example like this, say we want to get off for the students, we want to count the number of students. So have a login where the login ends with atcs. Like you have an atcs email address. And so we just put the count function here and then the inside of it actually doesn't matter for a count, but we're just going to count the logins and then we have our where call specifying when or what tuple should qualify. So again, my example here, I'm putting login. You don't actually have to do that. You could put a star. That's equivalent, right? Because again, it's just counting the number of entries. You can actually put one. Again, also equivalent. It doesn't matter. Inside you can really put anything. You can put one plus one plus one. And the data system should be smart enough to realize that, okay, in this last example here, I don't care what the expression is inside of the count and I won't actually do that math because I didn't care about what's the count tuples that I have. You can have multiple aggregates in a single select output. So here now we're going to compute the average GPA and the counting the number of students, again, that have the at-cs login. And you get back a single entry or single record in the output result for the two computations. The point I think you understand though with aggregation, since you're trying to coalesce down multiple tuples to a single scalar value, you can't reference anything in the select output that isn't part of the aggregate. So I can't do something like this. I can't go select the average GPA after you join the student table and the enrolled table and then also spit out the course ID of the enrolled table, right? Because this is not defined, right? This doesn't make any sense, right? Again, you're taking multiple rows. You're condensing it down, coalescing it down, collapsing it down for the average. What is the course ID in this context? Right? It's nothing. So in this case here, what you sort of look at is what you're really trying to do is you're trying to get for each course ID get the average GPA. So what you need to do, it uses a sort of group by clause where you're going to project tuples into buckets based on whatever the parameters are in the group by clause and then compute the aggregate on each individual bucket. So you sort of think it like this. If I first do the join between the enrolled table and the student table, I have all the, you know, I have all possible combinations based on the join and then now I'm going to split them up based on the course ID because that's what I have in my group by clause and then now I compute the average for the GPA for each of those buckets. Makes sense? All right. It just matches up like this. So again, the main takeaway of this again, you have to have anything that's in a, anything that's not part of an aggregation has to appear in the group by clause. So again, in this case here, I don't have the student name. I can't put that there. I'd have to put it in the group by clause. We can open the terminal if you want. My SQL used to let you do this in some cases. Let's try it and see what happens. I hate typing on my surface. I'm going to use this laptop here. I'll log into it. All right. So we want to do, this is all right. So I have Postgres. I have a bunch of database systems running. So the query we were trying to do was, let's see, this, right, select average GPA, course ID from enrolled, joining enrolled table and student table, right? So Postgres doesn't let you do this because it says the course ID has to appear in the group by clause. That's good. That's what we expect. Let's go over to my SQL. My SQL doesn't let you do it. Let me see if I put it in the right mode by default. All right. It doesn't let you do it, but there's a way to, it's enforcing, it's enforcing the, what mode it's in. So if I go to traditional, then now run the same query, it lets me do it. And it spits out course ID, 5445. Is that correct? No, right? Because what does that mean? It's the average GPA for all courses, but that's spitting out one of them. So that's, that's bad. So let's go now, take the same query and we'll go over to, to SQLite. All right. Who thinks it's going to work? Raise your hand if you say yes. I'm a man who worked, meaning like it'll actually run the query. I'm not saying the result's correct. Who thinks it's, we have one yes, two yeses. Who says no? Mostly we'll say no. You did it. All right. And it also spit out 5445. Is that the same value as my SQL? Yeah, GPA looks, sorry. GPA looks the same. All right. Let's go to Oracle. Oracle doesn't like it. Let's go to DuckDB. DuckDB didn't like it. So my SQL, you can do it if you, if you make it be more like my SQL five and seven, seven or eight, but SQLite will do it. So again, like the, this is the first example we'll see it many times that like SQL says, standard says one thing, but different systems are doing different things. All right. Next thing you do is have, you can have a had in clause and say if you want to start filtering on, on these aggregation, the aggregate columns you're generating, you can add a having clause to specify whether what, how many, what tuple should match after you compute the aggregation. Right. So say I want to get only show me, show me only the students that have an average GPA that's greater than 3.9. So in this case here, I'm computing the aggregation, right, select average GPA as, as our GPA. And I'm trying to reference it here inside of my where clause. Right. I can't do that because at this point when the system is actually calculating the query, it's computing the aggregation as it goes along. It can't, it doesn't know what the final result is. Right. So the easy fix for this is to have a having clause, which is basically telling the system, okay, format our aggregation is produce the output that's defined in the select statement and then apply this additional filter for having. This is actually not correct either in some cases. I don't think the SQL standard lets you do this either. Right. Because even though I have an alias up here for average GPA, the, the, the data system could say, I don't know what this is. My SQL lets you do it. Postgres does not. So instead you have to basically write the, the, the, the aggregation clause again. And again, the database system should be smarter if they recognize that this average on the GPA is the same as that average GPA up there and therefore compute the same computation, don't perform, perform, perform the same computation twice. Right. So essentially just doing this again, compute the aggregation and then do the additional filtering to throw out things you don't. Makes sense? All right. Strings and timestamps are, are dates of when things get, get really, get really weird, not weird, but like really inconsistent. So for string functions, string operations, or sorry, string data types, the SQL standard specifies that the, the case of the strings within the values, because I don't mean the strings in the select statements, I mean like the actual data you're storing, that they should be case sensitive and that you, you, when you want to have in your SQL statement constant strings, you want to use single quotes. Postgres, SQL server, and Oracle follow the standard. MySQL is by default case insensitive and then they, both SQLite and, and MySQL support both single and, and double parenthesis, or double quotation marks to represent constants and strings. So let's see what MySQL does and see how weird this is. So let's go back here. Right. So, so you can represent a constant like this. Right. So you can have a select statement without a, without a from clause in MySQL. Right. And I can represent, it basically takes whatever the input is and I can, it'll spit it out. So I put a comma, like I can get like, I can do like one, two, three, like that. It'll make columns for all the, the things in the output. So for strings, I can have it as double quotes and single quotes. Right. In the case of Postgres, it won't let me do double quotes. Right. Can't do that because it's trying to look for a column name Tupac. That's the way you sort of escape column names. But it'll, it'll support single quotes. So in SQLite, they support both. So that can go Tupac like this and I can go with single quotes like that. In Oracle, it's single quotes, but it doesn't like queries without, without a from clause. So in Oracle, they have this weird thing called the dual, the dual table. And this is a fake table that comes with Oracle to allow you to write these kind of queries that against tables that don't actually exist. Right. So then I can get that. If you try to do like select star from, from dual, you just get like an X. I think the newer version, they got rid of the dual, you don't need anymore. This is, this is Oracle 21. So it's a rather newer version. Right. So like you can't do, you can't do this, but like in Postgres or any other days that some, you can treat the, you can treat SQL as like a calculator. You can put whatever you want in a clause like that. Right. So let's go back to my SQL and let's look at some string functions. So I can call now, I can use like select star from student where name equals Tupac with, you know, weird casing. And then it matched on the string Tupac. Right. Because internally my SQL is treating the Varchar as, as case insensitive. So if you want to now, if you want to have it treated like a, you know, like any other database system where it actually is actually looking at the case as a true Varchar, you can add this binary flag in front of the, or a keyword in front of the column name. And that'll treat it as like a binary string like any other system. In this case here now it doesn't match, but now it tells me I have a warning. So now I got to go now and call show warnings. And this is again, this is my SQL specific. So now they tell me that the, the binary expression is deprecated and be removed. And they tell me at least how to write it correctly. So now I have to cast the, the name as, as a binary and then I can call it. Right. So if I change the casing again, then I get Tupac. So this burns a lot of people because they end up, you end up like thinking, Oh, I'm, if you don't know that your Varchar is case insensitive, you could store things multiple times and, you know, thinking that it's going to be different because the case is different. But then my SQL says they're, they're the same. Again, this is only my SQL. I don't know any other data system that actually does this. So that, that's a weird one. Yes. Your question is why is the name capitalized? Yes. So, so this question is, I'm telling you, so there's the data is being stored with the case sensitivity. The comparison operator when actually exits the where clause is ignoring case. Right. So it's not calling whatever string compare that you'd have in libc. It's calling either their own version of it or the case insensitive version of it. Because that was some decision that somebody made in the 1990s that has carried over today. Yes. Why did they make the decision? Ask me that question at the very end every get through all the. Again, it's, it's probably because somebody just did it the one way, you know, they decided how to do it. Right. Or my SQL, the guy was actually in many cases trying to follow what Oracle did in some cases, but Oracle doesn't do this. I have no idea. Right. We can email the guy. He's still alive. So yeah, I mean, there's a lot of times where people just did stuff because like one person did it without really thinking through the rent implications of it. Or they're trying to copy some other system where they liked up some other, you know, particular feature of functionality. Right. Any other questions? We'll see many examples where like why would anybody ever do this, you know, do it this way. All right. So I think I showed a query like this before. Just make sure you see it. So there's this like operation in SQL, and you use this for sort of really primitive string matching and pattern matching. So you would use a, you call like, and then you would say you would have a percent sign to represent a wild card. So instead of, if you're coming from, from like the Unix world, star or regular expression usually means match anything or dot. In SQL, it's the percent sign and that'll match it. Any substring, including empty strings. But if you just want to match one character, you would use the underscore. And there is support for regular expressions. I forget whether that is in the SQL standard, but everyone does it slightly different. You can write more complex string matching patterns. There's a bunch of string functions that also come in the SQL standard. They do things you would expect. Like if you're familiar with Python, there's like all the Python functions, uppercase, lowercase, substrings, replacing strings, right? All that, all that's in the SQL standard. And for the most part, these are going to be pretty consistent across the various systems. Where things go wrong is what we think would be the most simple operation, concatenating two strings. That's where everyone likes to do something slightly different. So the SQL standard says the double, the double bar is the way you concat strings. In SQL server, they support, they use the plus sign. And then in my SQL, they don't have, at least under the default mode, they don't have the double bar. They don't support the plus sign. You have to use the concat function. We can see that real quickly. So, going back to my SQL. So if I want to do something like this, I get another warning. I show warnings. And it tells me it doesn't like my syntax. That was the first one. Sorry. Boom. I've got two warnings. It says the double bar is synonym for the or and therefore it's to be deprecated. And they didn't like the way I was sending along the at sign in CS. So we now try to call, we changed the SQL mode in my SQL to follow the SQL standard. Now I can get the concatenation that I want. So again, concatenation I think it would be super, it would be, everyone should do the same thing. But again, it's some, in case of my SQL, it's some legacy thing from the 90s that they're trying to slowly undo. All right. Date and time is probably the worst one. So the SQL standard defines a bunch of ways to define date types, time types, also time times with timestamps, different calendar types, Julian calendar, Gregorian calendar. But again, how the syntax is going to vary is going to be pretty annoying. So I want to give it now a demo where try to do what would seem like a simple calculation, a simple computation. We just want to count the number of days since from today to the beginning of the year. It's like 230 something, 240 something, right? Just the total number of calendar days. So we're going to do this first in Postgres, and then we'll do this in mySQL, and do this in SQL cyber. So the first thing we need to do is figure out how to get the current date, the current time. Well, there's in Postgres, there's a function called now, and that'll give you, you'll get back a timestamp with the current date. In mySQL, you can do the same thing. In SQLite, you don't have a now function. In DuckDB, DuckDB is going to follow pretty much Postgres for a lot of things because they use the same SQL grammar. So they have a now function. We'll go to Oracle. Oracle does not have a now function. All right. So there's another way you can get the timestamp. So in the SQL standard, there's something called a function called current timestamp. All right. Except it's not a function. It's a keyword. And then in mySQL, they have the function. They have the keyword. In SQLite, they don't have the function. They have the keyword. And in Oracle doesn't give us a weird error about that one. We'll come back to that in a second. And then they don't have the keyword. So they have the function, but we're getting this other weird error. Date, time, interval, precision at a range. Okay. So what's that? So now we've got to go back and maybe, oh, because, right, it's Oracle. It doesn't like having a select clause without a front, a select statement without a from clause. So let's add our fake table, dual, right? Then we get it, right? But it's the keyword and not the timestamp. All right. So now we can get the current timestamp of the current day. And so what we can do is now we can start casting strings or varchars into date types, and then there's this extract function in the SQL standard that allows us to extract some part of that data timestamp. So this is saying extract the day from, and then today's date as a string casted into a date type. All right. And again, there's syntactic sugar for all these different systems that are like non-standard. So in Postgres, if I try to just give the string, it's going to throw an error because it says I need to operate, the extract function needs to operate on the date, or you're giving me a varchar. But I can add the two colons at the end, and then put date at the end, and then that's going to cast it to a date type. Can you see that or no? Sorry. Shoot. Sorry. Let me do this. Yeah. I know it's wrong. Sorry. Let's try it again. All right. So here, I can give it a string, and then I put colon colon date, and that converts it to a date. But that's only in Postgres. I can't do this in any other system, except for DuckDB because they follow the same standard. So if I go to my SQL, try to do the same thing. Doesn't like that. Go to SQLite. Doesn't like that. Go to DuckDB, or Oracle's not going to like that from dual. Doesn't know what a date is, and this DuckDB should do it. Because again, DuckDB follows the same grammar. Okay. So we can use this extract function to maybe extract what the current date is, or try to figure out how many days since from now until beginning of the year. So let's start with Postgres. So it turns out it's pretty simple with Postgres. So we can just cast the string of today's current date to a date type, and then subtract it from the string of the beginning of the year. If you wanted to, we could go back here and use current timestamp, or maybe use the now function. This should work. So that gets today, cast it as a date, taking the case date, and subtracting it by the beginning of the year. And we get 241, which I assume is correct. So let's try the same thing now in my SQL. Since they don't have the now function, we'll do it with casting. So now we get a weird number. We get 729. What's that? And surprisingly, actually, somebody on YouTube in a comment of all places told me what it was. It's the, and this is weird. So the first number is today's current month subtracted by January. So 8 minus 1 is 7. Then it's, today's what, the 30th? So then it's the today's day subtracted by January 1st. That's 29. So you get 729. So that's wrong. Can't do that. So what we can do instead is we can, we can, sorry. We can get the, 20 windows, sorry. All right, there we go. Okay, sorry. So what we're doing here now is we're getting the, the UNIX, we're getting the date of today and beginning of the year, converting it to a UNIX timestamp. A UNIX timestamp is the, it's the number of seconds since the UNIX epochs, like January 1st, 1970. So we're converting it now to the number of seconds from today, since 1970. And then we subtract that from the number of seconds since January 1st. And we divide that by 60 seconds times 60 minutes times 24 hours. And we get 241. So this is, this is my original idea. And then turns out there's a date diff function in my SQL that you can do this. But Postgres doesn't have it. DuckDV doesn't have it. See if Oracle has it. I'm dual. All right. They don't have it. All right. All right. So that's my, that's my SQL thing. All right. So now let's try in SQLite. So SQLite doesn't have date diff. We can't do that, that subtraction that we did in Postgres. The best solution I could come up with is the convert the timestamp for today, beginning of year, to the Julian calendar, which is the number of days since Julian Caesar's birthday in whatever BC. You laugh, but a lot of the banks ran off that in the up until the 80s. And then you get 241. But of course we're getting it as a floating point number. So we can cast it as an integer and then we get 241. All right. I'm not, I forget how to do this in Oracle. We're not going to do an Oracle. But the main point again, it's like, seems like it'd be a simple thing, but all these timestamp stuff is, is, is woefully different. All right. Any questions about this so far? Yes. Why would you want a lower function? Good question. I mean, you might need it for like data cleaning. You might want it for, yeah, it's a good question. It's in the standard, right? It could be, it doesn't have to be also in the where clause. You can have it in the from clause, right? So if I go back to my SQL, right? So select star from students, where a name equals Tupac, right? Student singular, right? So maybe I want to do this though in my output, right? Get it lowercase like that, right? Yes, sorry. Yes. Yeah. So her question is, why do people have all these weird idioms in, in their, in their SQL when at a high level, they seem to be all sort of doing the same thing, but it's these one-off things are different. And that's a related to his question. Why do, why are all these, why are all these different nuances for the different systems? Because somebody was writing and thought it was cool, right? And then they showed their friends like, yeah, that's cool, right? So that the double colon and postgres, I agree that's cool, that casting thing, but they only do it, right? The dual table at, I don't know what, I, whatever. The, yeah. So like, give another example. So like, there's a shortcut in SQL to do basically select star, right? So select star from a student gives you all the tuples, right? But in postgres, which I think is also in the SQL standard, I can just write table and get that, right? In my SQL, I can do that. That's cool. SQL light, yep, doesn't like it. In DuckDB, they do it, but they also have another one. They can, I think you just go, I think you just go fetch. Now where is it? From, right? You can just do that. So they all had the weird idioms. I mean, so some of these things where are, are based on customer feedback, like the customer says, I want, you know, I need functions that operate on JSON, right? So somebody adds that. And a lot of times these features get added before they show up in the standard, right? So like the JSON XML stuff is a good example of this. They, they, that got out of the SQL standard like 2006, but a lot of relational databases at the time, the early 2000s had some support for XML. And so what happens is like the standards body is, it's, it's not a bunch of randos. It's the people at different companies. So in the SQL standards body, there's, there's somebody from Oracle, there's somebody from Sybase, somebody from, you know, IBM, and they show up at the standards committee and they all try to get whatever they have proprietary thing that they have, they try to get that into the standard, right? Oracle probably did this more, the best example more recently, Oracle got their version of property graph queries in the SQL standard, right? They based theirs on Cypher, which, which is a Neo4j. That's now the PGQ stuff in the SQL standard. So they got their extensions for, for graph queries in the SQL standard because they were sort of almost ahead of the time. So that's how these things show up in the SQL standard. And so if everybody has competing ideas for how something should be done, you end up with the lowest common denominator. Somebody could try to support everyone, but then, then no one exactly supports the, supports the standard. I'm not saying it's a good thing, but like it's, we live in a different time also too, where there's so many different database companies and there's not one, there isn't one company I say that owns the market and is, and can bend people according to the will, right? So I said before, in the 1980s, IBM was, was the huge company, right? IBM was the, the computing company. So whatever IBM said, that was considered the de facto standard. And so that's sort of how we ended up with, with SQL today. But there isn't a company like that now. Like the closest thing would be Google put out their standard of SQL called Zeta SQL, internally it's called something else, but they, they open source a parser and, and, and the, and the grammar file and the spec for their version of SQL. Nobody uses it on Google's huge, right? The closest you're going to get today is Postgres. A lot of these database companies, when you start out, instead of building like the grammar file from scratch, you go take the Postgres one, hack it up and inject it in your system. That's what we did. And then DuckDB took our code and they put it in DuckDB, right? Like this, bunch of systems are based on Postgres grammar because they, because it's open source and they use it. That's the closest you're going to get to a universal standard today. But again, I just showed you how there's from in, in DuckDB, but that's not in, in Postgres, right? They've adapted it. Yes. This question is, what's the point of having a standard if no one's going to follow it? I mean, there's a speed limit and everyone drives over it, right? Like, um, no, so, so I showed you what's a select statements, like the, and that were slightly different from one system to the next, but you understood what it was doing, basically, right? The nuances of different systems, yeah, you may have to go read the documentation or ask chat GPT what to do, but like, at a high level, the concepts are the same, right? Just the, you know, the, the, the specifics of each, each system is going to be different. Snowflake is a good outlier. Actually, Snowflake started from scratch in 2013. They didn't take Postgres. They said they just came up with their own grammar. So there's now a Snowflake SQL grammar that has things that other systems don't support. If I was, if I was building a new data system scratch today, I would not do what Snowflake did. It was a different time. I would start with Postgres and then expand upon it the way DuckDB did. Keep going because it's still a lot to cover. In the sake of time, I'm going to skip output redirection because you're not really going to need that for the homework. Let's jump ahead to window functions. All right, so before we showed aggregations, they were computing sort of, sort of one shot calculation across the entire input set to, the relation that was being inputted to the, to the, to, you know, for the aggregate function that you're operating on the from clause. But there's also times where you may need to want to support what is called a sliding calculation where think of it like a rolling tally as you go from one tuple to the next as you're scanning along, you want to update some, some kind of aggregate function so that for every single tuple that you're outputting from your select statement, the aggregate is, is, is sort of a snapshot in time of when that, that tuple was processed, right? Sort of the way it's like, it's like an aggregate function where you're not grouping them into a single output for every single, you know, single final output. For every single tuple, it's going to have its own computation for that aggregation. So the way this works, you would have like a function here, right? This would be all your aggregate functions, min, max, count, average, as we saw before, as well as some additional ones. And then you're going to specify what is the sort of scope or the range that you're going to compute this calculation for, right? Basically sort of how to slice up the data and source it, and sort it. So let's look at some examples like this, right? So I can have all the aggregation functions that I have before, min, max, count, so forth. But then I have these additional ones like the row number that tell me what row my tuple is, is in my output, as well as a rank if I'm sorting them. So if I have like an order by clause, like order students by GPA, I can tell you what your position is using the rank function, right? You can't do that with a regular aggregation function, because there's things that get collapsed down, right? So in this case here, this example here, I can do select star from row number over and then the empty parentheses, because I'm not partitioning it. And that'll give me output like this, what it'll tell me, again, for all my output tuples, where do I appear in the list for that, right? If you have the over clause, you can specify how you want to group tables together, or sorry, group tuples together, rank computing, window function, and then you can use the partition by like a group by of how to group them up. So for this query here, we're doing select the course ID and the student ID from the roll table, and I want to get the row number of each student record in the roll table, but then I want to partition it by course ID. So I would get an output like this, for every single course, it would tell me for every student ID what position they are in that group, right? This is sort of a cluster like this. Then if you have an order by clause, you can then control how the tuples will be sorted within either a partition or within the window, right? So in this case here, now I can order the students by the roll table based on the course ID. So I look at a more complicated example here. So we're going to find the student with the second highest grade for each course. So for this one here, we're going to have a nested query, which we'll discuss in a second, but basically I have a select statement that has a from clause, and beside that from clause, I have another query, right? And I can, inside this inner query, I can reference, actually this here, I'm doing a look up on the roll table, and then the outer query can just do filtering based on the output of this nested query. Going to cover nested queries in a second. So the first thing I'm going to do is going to group the tuples by the course ID, and then sort them by the grade, and then we'll get the rank, what is their position in the sort of list of grades, right? And then in my where clause here, I can reference now the the window function calculation for column, right? So any questions about this? Yes? This question is, can I make a window function using group eyes? Let's try it, see what you're saying. All right, so we'll do some Postgres, right? So again, select star from the roll table, and then we'll get the row number of, you know, where each student appears, right? And then the second, excuse me, the second example was, we get the course ID, student ID, and then the row number, we're going to partition it by the course ID. So then we're just going to order them in the output by the course ID, right? So in this case here, we see that we have, for each course, 1545, 721, and 826, right? Here's the students that are rolling them, and then this is their position within each group. And then my last example was like this, and this is where you were asking whether you can do a group eye. We're now here again, so I can get the first, the inner query is going to give me the rank position of every record, and the rank is just where you are in the sorting output. Actually, let me remove this part here first. Right, so here's the output of the inner query of the select rank. So for every course, I'm going to get the grades, and I'm going to order them by the grades, and then the rank is just where their position is within the sort of the grades. And the rank can have repeats. So if I say, I insert another record here. So insert into enrolled values, so we have student ID. Let's do have Tupac take. So values, course ID would be 15, 721, and let's say he got a, let's give him a name. He's dead. Um, what do I create? That's ordered by rank. But actually, that screws up the partition. I'm going to give it that, sorry. Right, so here what we're doing is every single course, again, we're getting the grade, and then with the sort in the rank. And so we inserted this record here, Tupac, we gave an A, but there was also another student who got an A in the same class, and therefore they both have the same rank position of one. And then for the, the student that got the C, their rank position is three. So rank, you can have duplicates, row numbers will not. So yeah, so you're proposing to do what? A bunch of random group. Where? Like, sorry, in the interquery, or what? Sorry. It's a question, is it possible to recreate the same query using group? Um, you wouldn't, you wouldn't be able to get the rank, right? Because you wouldn't be able to get, where does my sort position? There isn't a concept of that in, in, in, in SQL. Right? So, so row, row number is interesting because it is, row number, switch to row number, row number. So row number is interesting because, it's calling, it's calling it rank, but trust me, it's row number. It's because, again, it's bag algebra. There is no sort order in, in these relations. And that's a sort of weird concept that we think about programming. Like, what do you mean there's, there's not ordering? Like, because we're used to programming like under x86, where there's an, you know, the ordering, how memory operations occur, right? There isn't any of that here. Everything can be unordered. So in, without a window function, you can't get a row number because there's no way to say, where do I exist in this position, you know, in my position of my output. Oracle does have row number. They hide it from you. You can get it, but like, it's, it's, that's just an oracle thing. So, so the window functions allow you to, in addition to doing the averages and all the other aggregates, it allows you to, to get the order of things in a way that you would not be able to do it otherwise. Right? So I should have showed nested queries before, but let's just go through it in more detail. So a nested query, nested queries are a really powerful concept, sometimes called subqueries, where it allows you to have a query inside of, a query inside of, like, inside of a query. Like, you have multiple queries inside of, sort of overarching calling queries. And you would need this because you want to be able to express certain computations, it'd be difficult to express certain computations without these nested queries, without taking the data out, doing some computation, and then putting it back in the, in the database system. So it allows us to put these things together to, to curate more complex logic than we would not be able to otherwise do. And these inner queries can appear almost anywhere inside of a, a select statement, or actually really almost any query. Like, you have in the select output, the from clause, the where clause, you can put it in update queries and delete queries. Right? And they can now reference all the tables within your own query. Like, it's a very powerful construct. So the basic idea is something like this. So here we're doing selecting from the name table. And then I want to get the, the name of a student that is at least enrolled in, in one course. So you can think of this out, this, the select statement at the top part, that's called the outer query. And then this inner part here, we would call this the inner query. So nested queries are notoriously difficult for database systems to optimize. Right? Because you think about the stupidest way to execute this query would be for every single tuple in my student table, rerun this thing. Right? Get the list of all the student IDs, then compute the in. The way to really execute this, this is just a join for this one example here. Right? This one's easy to do because, you know, you're looking for this, this thing to match something here. So you can do like convert that to a quality predicate. Things get more complicated when there's a non-trivial relationship between the inner query and the outer query. We won't, we'll, we'll, we'll come into that later in the semester. But this is something, this is, again, this is the hardest, the hardest part of database systems. And the only system that does this, does nested queries correctly is the system called Umbra, which is a, is a academic system out of Germany. DuctDB does it correctly now for two reasons. One, because they copied what Umbra did, it's in papers, it's not like they stole the ideas. And then we also sent them patches last semester at a 721. So we fixed it for them that they can do some of these nested queries correctly, at least with lateral joins. So DuctDB is probably the best implementation of this. A lot of times there's a bunch of heuristics, hacks. Again, we'll cover this later. My SQL is always the worst. It's gotten much better though. Right. So the, so here's the query like this. So we want to get the name of the students rolled in 15-445. So we have the outer query that we say, you know, we want to get the name from the student table. And then we want to have this where clause. We want to specify the logic that will get us the student ID of the set of people that are taking 445. So this is a way to sort of think about how you want to actually construct this. Start with the outer query or with the overarching computation of the output you want to be. And then you figure out what the inner part needs to be separately. So in this case here, we can convert this English part here into a domestic query like this. But now we need to be able to reference it or do the check that we want in the where clause of the outer query and would use that in clause that we had before. So in this case here, now we see that the student ID in the, in the where clause here of the outer query, that's referencing the student ID from the outer query. But the second student ID in the inner query, that's referencing the student ID in the enrolled table. So the parser in the database system is smart enough to recognize the context of where a column is being referenced to know which table you're looking at. In the cases where it doesn't know that two things have the same name, it'll throw an error and make you qualify the table name of where a column is coming from. So there's a bunch of ways you can interact with nested queries to do, instead of where clauses. So you can have things like an all command or all operator that every row in the nested query has to satisfy some kind of strain. You can have any or sometimes called sum as the alias, S-O-M-E where you can say at least one row must match my subquery. The in clause is this that I showed before, it's the same thing as equals any. And then exist just means that I want to find something where I know there's at least one match, sorry, there's just one row being returned but I don't actually care what's in it. So I can rewrite the example I have before instead of using in, I can use equals any and it's considered equivalent. And so we can show real quickly how Postgres picks different plans for this and you see how it's actually being executed. Right, so here's our query. We have RZA and Tupac taking the class. So in SQL you can put this explain keyword in front of it of any query and what that's going to do if the system supports it, it'll come back with the query plan and tell you what operations would it execute if it actually tried to execute this thing, right? So when we run that we get something like this that's going to tell us basically think of this as a tree structure. So these are the leaf nodes and then it builds up this is the final output. So this is telling us that we're going to, the Postgres wants to do a sequential scan on the enrolled table and then it's going to hash it because it's doing a hash one up there which will cover what a hash one is later on. And then it does a sequential scan on the student table and then now it does by matching the student ID with the enrolled student ID, with this enrolled student ID. So Postgres was smart enough to convert this nested query into a join which is always going to be the fastest way to execute something when you have these kind of references. We can try the same thing in MySQL but you get, their explain output is terrible. There's a way to get, I forget the syntax, you got to put like an extender to something like that. I forget how to do it in MySQL. There's a way to get something a little bit better. In SQLite, I don't think you can do this. Oh, you can do this. SQLite, they don't like the select statement. Surprising, right? Why doesn't that work? See if DuckDB does it. They give you the output. DuckDB has very pretty, they give you nice little trees. You guys are easily amused. This impresses you with like Unicode output for explain? Oh my gosh, all right. But yeah, it gives you the, shows you what the physical plan is. And then we can try it in Oracle, right? It has the right output. We knew this another time. Getting the plan out of Oracle and SQL Server is a huge pain. But I'm actually surprised that SQLite doesn't support the select. I'm not going to debug this live. Yeah, I don't know why it doesn't like that. Let's try in. Ah, there we go. Didn't like egos any you liked in. All right, so in SQLite, if I run explain, I get this. So the way SQLite does, which is genius, is that it, the way it executes your query plan, it converts the query plan into its own DSL and op codes, and it has its own VM that runs the op codes. Think of like the JVM. You take Java code, convert it into Java byte code, and then the JVM executes it or interprets it. That's what SQLite does. All right. We'll discuss query compilations later in the semester. So you got to put explain plan. There's some syntax to get the real plan, but trust me, it's there. All right, so yeah, so they all do something slightly different. And then if the system is smart, it can try to convert it into a join. All right, so we'll skip this and take a time because we've got to get through, I want to get through lateral joins and CTEs. All right, so lateral joins are a newer concept, but not all systems are going to support it. But the basic idea is that it's going to allow you to have a nested query reference data in another query that is adjacent to it. So normally if you have two nested queries, one nested query can't reference what's inside the other nested query because it doesn't know about what's inside of it. But with a lateral join, it allows you to do this. You almost think it's like a for loop where one table for every single tuple in this outer for loop, you can run some query, do some computation here. So in this simple example here, I have two nested queries. I have a select one as x, so this is turning back a single tuple that has one column with a value one. And then my lateral join here can now reference the output of this first query here and just do plus one on it. So I get one and two that way. Without lateral, you can't do this because this would be treated as completely two separate queries, which we can do this in Postgres and see real quickly. So select star from an inner query, select one as x as t1. So I can get back a single tuple that has one in it. But if I try to put another nested query next to it, select say two as y as t2. I'm getting the Cartesian product, but I can't reference inside of this thing. I can't go t1.x plus one because it doesn't know about t1 because those two sets, those queries are running separately. If I add the lateral keyword, now my second nested query can reference whatever is in the first one. And you can chain these things together as many times as you want. Let's go back to it quickly and see what the query plan for this one would be. In theory, you should convert this to a join. All right. It did a shortcut, but you know that. Basically it says, I know what the answer is. I don't have to run anything and just spits out the answer. That's what it did. Like select one plus one, it knows how to compute that without running a query. All right. Let's look at a more complicated example. So say I want to calculate the number of students that are enrolled in each course, and then I want to count the number of students enrolled in each course, and then I also want to get all the average GPA of all the students in that course. And so yes, you can write this without using a lateral join. I just want to show you how to do this with a lateral join. So there's two of them, two nested queries, where we had to select statement on the outer part, and then for every single tuple that's in the course table, I want to then compute the number of enrolled students. And then again, for every single student in the course table, I want to compute the average GPA of all the enrolled students. Right. So I could write it as this. We have two nested queries that are with the lateral keyword, where again the first one here, I compute the count. Again, inside of it, I'm able to reference what's in the outer query here, for the adjacent query, and then for this one down here, same thing, I can have this one referenced there. Now, I'm not showing this example here because they contrived, but like in the second lateral query, I can also reference what was in the first one. These things get changed to get chained together. And again, this is a different concept when you think of SQL, because SQL is the unordered. We're not specifying the order. We don't specify the order in which the database system should execute anything. We're not really doing that. We're just telling it that the order we want the computation to be performed to compute the answer that we want. So the database system can decide, do I want to rewrite this as a bunch of joins and just execute them all concurrently, or we can decide to do it one after another, which we can then test Postgres real quickly and see what it does. So I don't think I have copied here. Yeah, sorry. I don't have copy pasted real quickly, so you can try it online later. All right. The last thing I want to show you is common table expressions. And so CTEs were added 10-ish, 20-ish years ago, and this is sort of similar to nested queries or similar to if you're writing data to like a temp table or something like that. It's a way for us to specify a query that we want to get materialized. Maybe I don't want to use that word. We want to specify a query that could be stored in quotation marks at some virtual table, and then we can have another query reference whatever's inside of it. Right? So in my really simple example here, I have this with clause. I give my CTE a name. Then I have my as statement. And then whatever's inside this parentheses, whatever select query here is going to get bound to this name here. And then editing that comes below after the with statement can then reference it as if it was a table. Right? So again, sort of similar to something like this. So essentially the as clause is binding names to whatever's inside my with statement here. So I have, again, a no table query, select one or two that's going to produce one two, but it has a value of one column of one, one column of two. But then within my with statement up here, I can give now names to the columns, which then can be referenced down below in the query. You can do weird things too. Like you can actually, in Postgres, we'll let you actually name the columns the same thing. But then when you actually try to reference it below, it'll throw an error. So again, this is an example where the syntax is roughly the same, but the semantics can be different across different systems. Let's see how, so this is how we actually want to use it. So for this one, when again, we're going to find the student record that has the highest ID that's enrolled at least in one course. Again, we can show examples how to do this with nested queries that we do with joins. But now we can do that with the CTE, where inside the CTE, first thing I'm going to do is compute the the max student ID from the enroll table. And then now in my select statement down below, I can as a reference my CTE, get that max ID, and then do my join on that. Again, the database searcher should be smart enough to realize that, oh, I only have to run this CTE once, materialize it, and then now I can reference it as if it was like a temp table in any query below that calls it. So any question about CTEs? Okay, just to finish up. All right, so again, hopefully the main takeaway from all this is that SQL is not a dead language. There's a lot of cool things you can do with it. You want to try to do as much computation as you can within a single statement. Now it can be nested queries, you can do much of other weird stuff inside of it. We want to avoid the round trips going back and forth between the client and the server, because again, the database system should in theory be smart enough to know what's the best way to execute the query that you're giving to it. As soon as you take stuff out of the database, do some Python code on it. It's obviously outside the purview of the database system, so we can't optimize that Python code. If you keep everything inside the database system, it should be able to make a good effort to optimize it further. Again, also the main takeaway from all this is that there is a SQL standard. Nobody follows it exactly. Every single database system is going to be slightly different. People claim that, oh, it's great if you support SQL, because then you can go and be portable. If I've written my application on my SQL, I can very easily just port it to Postgres. That is not the case. Oftentimes, whatever data system you pick at the beginning, that's what you're going to be stuck with for a long time, because it's not non-trivial to move over. Last thing, homework one, it'll be out today. It's going to be writing SQL queries to do basic data analysis. This year, we're going to require you to do it on SQLite and DuckDB. Reason-wise, because you're going to write the same query, syntax will be slightly different. It won't be too bad, but you'll run the same query in SQLite, and you'll run DuckDB, and you'll see which one's faster. Then you'll have this epiphany, oh, one of them is much faster than the other one. Everybody think I guess which one's going to be faster? Why? What's that? It says more efficient. That's the query. Part of the reason, maybe. It's not based on Postgres. DuckDB does not interpret, and that's not the answer either. So, yes, in the back. One last shot. That's not the reason. So, you'll run these queries. DuckDB should be faster. You'll be like, okay, why? That's the rest of the semester. This will be it for SQL. Next class, we'll actually start talking how you build a system. Hit it. Black leather, black suede, Timberlands, my all black, dirty haters, send you to the Pernigate. You get gizama trying to skate, and that's your first mistake. I ain't lying, for that cake, your fam, I see your weight. My granddad's had me wait and ran through every stake. When he asked me how I'm living, I tell him I'm living great.