 I think on the course website the schedule is an advanced sequel but I've toned it down a little so it's just right in the middle where we wanted to be. A few just administrative things before we get started. So the late policy, what was on the course website didn't match what I said in the first lecture so this is the new official late policy and the course website should be updated and like this. You have a total of four late days for projects only. So not for homeworks, homeworks get no late days, you have four late days for projects only where you have no penalty extension. So for example this is just again reiterating things I said in the previous lecture but you could for example turn in one project four days late or you could turn in four projects each one day late and you won't be penalized on them. Again late days are rounded up to the nearest integer so if you're four hours late then it's going to count as one full day late. And again if you hand in any homeworks late or you've used up all of your extension days with projects you'll lose 25% of the assignment grid per day and then after four days it'll be a zero. You can allocate the late days so I think the instruction should say that you just have to specify that you're using a late day on the project submission. For office hours, again last time I told you we didn't have full clarity yet. We've added the TA office hours to the website. Some of them will be in person and some of them will be remote on Zoom. So whatever you feel most comfortable with feel free to choose between those two options as you see fit. For the in-person office hours we're still trying to figure out the exact physical locations where those will be. So we're working the schedule those but we're going to finalize all office hours. Everything will be finalized on the website by the end of this week. And again the plan is to have most or all of the days that we covered with more availability on the few days leading up to a big project deadline. The one last thing I want to mention, some of you, I think that's about slides being available before the lecture. So you can annotate them or take notes directly on them or whatever. I will try to starting next week, I haven't done it for today, but starting next week Monday I'll start posting them ahead of the lecture so you can gamma them and reference them right on them. So that's all the administrative stuff. Are there any questions about anything there? So now let's move on to the fun stuff content. So relational languages, last time we talked about the relational model and relational algebra. And the key idea was that we wanted to avoid having to tell the database management system, DBMS, exactly how to execute a query. If you want to say okay, you have to follow these steps to execute a query. Rather what we wanted to do was provide a high level declarative specification for the query so that the DBMS would be able to figure out how to do it for us. So essentially saying what it is that you want, the answer is that you want rather than how to get the answer. So just as an example, imagine you wanted some output that was sorted. You don't necessarily want to have to tell the DBMS okay, go do merge sort or go do a quick sort or whatever kind of algorithm. You just want to say hey, get me some of the data sorted and let it figure out how best to do that for you. So the main advantage here is again, you don't have to get bogged down in a little details, but also the DBMS can perform different operations or use different algorithms depending on your data data. So if your data changes, you don't have to rewrite your application. For example, in the Python code that we saw, CSB example, if you wanted to use a different algorithm, you don't have to go and modify that in the application level code. So you don't need to do that anymore. The DBMS can kind of take care of all that behind the scenes for you. And the way that the DBMS is going to sort of figure this out is through a pretty sophisticated process called query optimization. So there's a piece, a specific component of the DBMS software stack called the query optimizer that's going to take a query that you write and then figure out essentially the best way to execute it. And it's a really detailed process. There are a lot of algorithms and it's basically a search process where you're searching for whole different equivalent ways of X being length of the line, the most efficient one. And I think we're going to talk about that in a few weeks, or actually in a few weeks after the midterm probably. So again, this is one of the most difficult and time-consuming parts of building DBMS. This is where a lot of the engineering effort goes in the development process. So today is going to be about SQL. This is just a little bit of history to give a little background. SQL isn't a new language. It goes all the way back to the 1970s. Remember, Ted Cod released a paper about the relational model and he was a mathematician and an earlier programmer. So he was going to implement the system. It was just kind of a high-level theoretical way of thinking about the problem of database management. So the IBM group in San Jose decided that they were going to try to build a prototype DBMS based on Cod's ideas. And the system that they ended up with, it's one of the first relational database management systems was called System R. So specifically for the query language piece, which is how you interact with the database or query the data, these two guys Chamberlain and Boyce came up with a query language for it. And the first one was called Square, which is specifying queries as relational expressions. I don't know if it was very catchy since it didn't go anywhere. But their second attempt at that was called Structured English Query Language. It was spelled out SQL and that's kind of where we get the SQL pronunciation frame. So that was developed in 1974 for IBM's System R prototype. So again, some people say SQL, some people spell it SQL. Kind of the reason that came about was because the IBM team got sued. There's already some other language or something out there spelled out SQL, so they kind of abbreviated it to just SQL Structured Query Language. So again, either pronunciation is fine. I generally say SQL, but like I said, I'll know what you're talking about if you say it the other way. So System R was never released. It was just a research prototype. They wrote a bunch of papers about it. They got published, but there was never like a publicly available release. And what ended up happening was IBM released a series of databases, commercial database management systems over the next few years. And probably the longest lasting and most impactful one was IBM DB Tube. So the kind of SQL ANSI standard was released in 1986 and ISO in 1987. And again, it's abbreviated or abbreviation for an Admin for Structured Query Language. And this is kind of where it started and has gone through the specification process today. Like I said, there are sort of these re-releases updates of the SQL standard. It's not a dead language. It's been progressing and seeing these major releases kind of every few years, even though it's 40 years old, over 40 years old by now. And the people who are in charge of kind of releasing the standard, you can kind of see as certain technology trends when in and out of favor, they got added to SQL standard. And I don't know, it's unbelievably huge now to go through, compared to where it started in the 80s. But all of these things, like most recently JSON, which is popular now in web programming, web development, that got added recently to the standard in 2016. But the core structure of language has not changed. It's been pretty much the same since the initial specification in the 80s. But there is one catch, which is that none of major database management systems, the commercial ones, the open source ones, none of them really follow the standard. So it's called the standard, but people diverge pretty heavily. So the core piece that you need, the minimum language syntax that you need to be able to support, to say that your DBMS support SQL is a SQL 92 standard. But beyond that, there are a lot of kind of small details or ways that different developers have implemented things differently. And we'll see some examples in lecture today and get kind of confusing. I mean, you think that SQL is SQL, but it turns out there are a lot of tricks or caveats to watch out for in practice. Okay, so at a high level, SQL is broken down into three main parts. The data manipulation language, the data definition language, and the data control language. So the data manipulation language is responsible for retrieving and modifying data. So if you want to query a database or update some values in the database, you're working with DML commits. The data definition language is how you specify objects in the database. So for example, if you recall the music store example, you have the artist table, the albums table. If you want to specify those tables in the database, you're going to use DML commands to create the table. And it goes beyond just table definitions and create all sorts of things, indexes, triggers. And there are all any database components, any database object you can create is specified with DML. Finally, the data control language is used for security access control, specifying which users can view or update which parts of the database. So it's a way of limiting access to certain database objects. Okay, so there's also all these other features you can define views, you can materialize results. Integrity and referential constraints are something that is important. We'll talk about later in the course in particularly transactions. So if you have multiple concurrent programs or applications that are accessing a database at the same time, as I said, you want the DBMS, as I said last class, you want the DBMS to kind of manage those concurrent operations to make sure that you don't have lost rights or lost updates or everything sort of stays in the system. So transactions are an important topic we're going to talk about later in the course. And I think I alluded to this last time, but there's an important difference between SQL and kind of the relational model that we talked about. And that is that SQL is based on bag algebra or multisets. So that means you can have duplicates in your data set, whereas a set obviously means there are no duplicates. So if you think about kind of the free sort of data structure options, you have lists which allow you to have duplicates, but there's a defined order to the list. So you can, you know, end things at the end where you can find the element in a list. Sets mean you can't have duplicates, but there's no order, they're completely unordered. So there's no relationship between the relative position of objects in the set. And bags can have duplicates, but they're like sets, they're unordered. So importantly, just keep in the back of your mind that SQL is based on this bag algebra or multiset algebra rather than sets. So today's kind of agenda for covering the different parts of SQL. We're going to walk through each of these different things. So aggregations and group buys, certain operations for manipulating strings, dates and other time-based operations, output control or redirecting where you want query results to go, nested queries, which means you can have arbitrarily nested queries, common table expressions, which I think are going to be important for the some homework questions, and window functions which are also homework. Oh, sorry, the other thing I forgot to mention in the class. The first homework assignment was released. It should be on the website now. It's a SQL assignment. Take a look. Based on today's lecture, you should be able to answer all of the questions in the assignment. Okay, so the example database that we're going to work with for today's, all the examples in today's lecture is sort of this student course database. Okay, so the student table, which has a student ID, just a random number. The name of the student, the login that they use to get into the computer network, AH and GPA. And the courses are just different courses with the ID and the name of the course. They're all data related because that's all we care about. And then the enrolled table is going to map students to particular enrollments in those courses. So you'll see that in the enrolled table there's a student ID, the course ID for the course that they're enrolled in, and the grade that the student received for that course. So before we move into the more advanced stuff, I just want to go over the basic syntax. And today I am going to show a live demo of this, just a free life advice. The first rule of public speaking is never do a live demo, especially if you're in charge of like a multi-billion dollar company. But I think this is just in terms of databases, so it should be fine. So let me switch over to the demo. Okay, looks good. And I'm going to type on this computer-resist services and variables type on. So I have three dbms instances running on the machine. One is a Postgres or PostgresQL. Another is MySQL and the third is SQLite, which I mentioned last time. SQLite is the one that's probably the most deployed dbms in the world. It's on phones, web browsers, operating systems all over the place. So just to show you that this works, I hope it works, but we should be able to see. So this query is just very simple. It's just going to get all of the records, tuples from the student table. You can see it there. We've got them all back. Let's try the course one. Okay, there are all the courses. And finally, the enroll table. Great. Okay, everything's working. So we can switch back to this now. So the basic syntax, as we saw kind of a preview of last class, the basic syntax is to have these three pieces of the SQL query. So the select statement is where you list the columns that you're interested in seeing the columns or attributes. The from statement is where you specify the tables that you want records to come from. And the where statement is where you specify the selection or filtering predicates that you want to apply. So just as a, again, a really simple example, if you want to get the names and GPAs for all the students who are older than 25 years old, we would write it out as select name comma GPA. So that gets you those two columns from the student table where age is greater than 25. So that's your selection bracket. And again, if you think back to the relational algebra that we saw, the first statement, the select is equivalent to projection. So you're projecting on name GPA and bottom statement where is equivalent to the selection operator where your age is greater than 25. And again, I didn't come up with the names, but the select statement in SQL maps to the projection operator relational algebra and the where statement in SQL maps to the selection operator in relational algebra. And if we, there's a way I think if I swap out of this, it's going to throw me back to my slides, but we'll figure it out. Okay, so again, this query just seen and executed by Postgres. We paste it in there. Selects name GPA for the student where age is greater than 25. We'll get an answer. Everyone is greater than 25. Okay, so, okay, so, so I think in my opinion, the most conceptually difficult thing and it seems, it seems simple to a lot of people who work in this life or have studied them before. But the most conceptually difficult thing to me to understand as a new student is joints. So if you remember the relational algebra operator, you have the join from last class, and we can use that to combine two poles from two tables or relations that match on some particular key. So just as an example, here, if we want to get all of the students that got an A in this 15, 7, 21 worse. So just looking at the query, we have the same select statement. And what I've done in the from statements is I've listed both the rolled table and the student table. And I just gave a little nickname or it's called an alias, just to make it typing easier. So the rolled table is alias as E and the student table is alias as S. And again, the, the, the where clause is, is similar. So we want all of the students where their E dot grade is an A and the course ID is 15, 7, 21. And in this, the new piece I've added to do the join, the E dot S ID, so the student ID part of the enrollment record matches the student S dot S ID. So what this is going to do is going to join the student table to the enrollment table, filter by only the students that got an A and only the students in that class, and then return the names of those students. So does this, does anyone have any questions about joins? Since I know it's, it's kind of a difficult kind. You've seen it before, maybe it, it makes sense, but I remember when, when I was learning this stuff, I think this was where 99% of the confusion came from. So there are, there are many more examples. Yeah. Yeah. So the way that the query is parsed by the system is it starts with the from clause piece. So it'll look at the tables that you are, want to access. And since it starts with that piece, it applies the alias upfront. So then when it looks at the other parts of the query, the where clause and the select clause, it knows that what, what the alias is. Yes. What happens if you don't have the and E dot S ID equals S dot S ID? That's a good question. Do you mind if giving it to them? Sure. If, if you do not include the and E dot S ID equals S dot S ID, what will happen? So, well, we can actually try it. I have the demo set up here. So let's hop over. So it's like S dot name from enroll. It's a case sensitive as a student as S where, where E dot grade equals a. And E dot C ID equals 15, 721. And, oh, sorry. I don't know what happens without that. Okay. So what's going to happen is we're going to get all of the students back. So why is that? It's because it's going to go and look for any time that a student ever received an A. And it's going to return the name of the student. So there's, if there's no filtering for the enrollments, it's not going to be able to check to see if the student received an A in that course. So in this case, if there's ever a student that ever received an A in this case, it's going to return the name of the student. It makes sense. All right. So let's switch back here. Great. Okay. So any other questions about joints? The question was without the, the joint condition that matches the two IDs, does it produce a Cartesian front? Yes. So if, if there's no joint condition, it's going to get, it's going to perform the filter. The two filters here just on the enrollment table. And then it's going to match every student with, with the, the filtering enrollment. Yes. In that case, if there were multiple, with, with, each of them a multiple line? Or with, like, both languages? Okay. So the question is, if there were multiple students who got A's in the course, would it list each of the student's names? Multiple times. And the answer is yes. So if, if we go back and look at the, the definition, I think there was only the, the data definition, the insert statements. There was only one student who got a name of course. So if you join all of the student names with that one record from the enrollment table and A in it, you get three, three. But yes, if there were multiple, it would repeat them multiple times. So aggregates. Aggregates are functions that return a, that they're going to return a single value from a bag of tools. So what you can think about is if, if you have a relation or a filtered relation, you can aggregate across several tools. So there are these different predefined operators. Average returns the average value of a particular column. Min returns the min. Max is the max. Some returns, some you add up all the values in the column. And count obviously returns the number of tuples that are in the relation. Some, there are some systems that let you define custom aggregates. You can kind of make a drawing. There are others that aren't listed here that some system supports and done. But these, these are kind of the main important ones that probably get used most. So the aggregate function can almost only be used in the select output list. And by almost, I mean there are a few exceptions, but pretty much it's always going to show up in the select statement. Of your query. So just as an example query here, let's say that you want to get the number of students with the, with the, this, a login as the substring at CS. So I'll explain what the, what the string magic is in a second. But basically, if you recall what the input data looked like everyone had at CS user name, but imagine there were other students from other departments or something that had different subjects for their username. So the count function is going to go in the select pause. And we're going to say we want to count the number of logins and we can alias it as a CNT output there from the student table where the particular filter selection condition holds. So what we're doing here is applying where the login, the login name string is like similar to and we'll cover this in more detail on these slides. Basically it's just performing a regular expression pattern match on the end of the string. So the percent symbol and we're used to, you know, other, other wild card operators like a star or something, the percent symbol and SQL, let's use search for pattern. So you specify where the login is like percent at CS. This is going to get solved the, the count of students with an at CS login. Now this is one way to write it. There are a few others. For example, you could just say count star. So star, as I showed in the beginning, is like a, gives you all of the columns for a particular table. So we don't really care about necessarily the count of specifically the login column. We're just interested in the count of the number of student tuples or records that match the criteria that we need to be specified in the where plus. So you're free to just say, you know, count star, just to be count star of the result. And there's another way you just say, you know, count one. So count one, just putting kind of the value one in there. We're saying, you know, count up the number of records that this query returns. So for every time that we find a student that matches this filter condition, we're just going to create some one and then we're just going to count up the number of ones that we see. So the question might be, does the DBMS care what you do? The answer is no. The DBMS is going to be smart enough to figure out just during the optimization phase that all of these things pretty much need the same thing. You don't really need the login or the star operator. You get all the columns in your output. All you care about is the count. So all it needs to do is go through the relation and have the number of tuples that your report produces. So the question is, if you're doing the pattern matching and you have the at symbol that you're looking for. If you didn't have the at symbol in the pattern, do you still need the percent? You do still need it. So if there were no percent symbol, and you just had, for example, at CS, it's going to look for an exact match of the string of characters at C and S. If you want to find any string that ends with at CS, then the percent symbol is just like a little card operator that lets you search for an integer pattern. Okay, so we can also apply multiple aggregates in this like clause. We're going to just have to do one. So this query, for example, is going to get us the average GPA for all of the students, as well as the count of the students, again, where the login is like percent at CS. So if we run this, we're going to get back is a result that has the average GPA computing 3.8 in the count. The number of students that need this selection criteria. We can also throw this distinct keyword in there. So remember, I said that there's a sequel that is a baggy algebra, so you can end up with duplicates. So if you just want to know, for example, the number of distinct elements in any of these aggregations, so for example, count, get the number of unique students that have an at CS in their logins. So you might get the count of distinct logins. Now, this query maybe is a little nonsensical because I hope that everyone has a distinct login. If you have duplicates, there's probably an issue somewhere. But you could imagine there was a different, maybe it's the first name of the student. They're duplicating it into an account of all of the distinct first names. So this distinct keyword would like to do that. We have this query here. It's essentially going to get us the average GPA of students enrolled in each course. So if we run this query, what's maybe it seems like it makes sense to say select average of the student GPA. We provide the enrollment course ID because we want to know for each course, give us the average GPA. And then we do the join between the enrolled table and the student table. This seems like it's going to work, but this actually isn't. So if we have this value here, the reason is what does this mean? So we saw that the aggregates are going to return a single aggregated value, in this case the average GPA. But it's going to return us this single value of 3.86. What does that mean? And it's going to... The course ID is undefined because we're trying to take all of these students from all these different courses, smoosh them together into one GPA and then which course ID we can pick from the list that we have. So if I show you what's going to happen here in an example, and then postgres, and we're going to get this weird value. So it gives us 3.86. The average of... Oh, sorry. This one here. So it's going to compute the average for us in the previous one, but as soon as I add this e.cid to the query, it's going to throw an error here. It's going to say that the e.cid, it doesn't know what to do with that because it doesn't know how to form the grouping for the aggregate that it's computing. And I think if we do this, I can switch between... This one was trained in MySQL. Let's see if MySQL likes it. Nope, they get the same error, so they say that there's a value or a column specified in select clause. It doesn't show up. We have a case we're trying to do, and then SQLite is the last one. Let's see if this works. Okay, so SQLite doesn't have a problem. It gives us 3.86, which is what we saw as we didn't include the other piece, but it shows us 15.445. So what does that mean? I don't really know. So it's just picking a random course ID to fill in there because it doesn't know what the semantics are of the grouping for the aggregation. Yes. So the question is, is the value 15.445 from the first line potentially in the Cartesian product? Right, yes. So is the value 15.445 potentially from the filtered set that produces the Cartesian product? And the answer is yes, it could be. The value is undefined. So clearly their implementation allows you to do this, but the way that they provide that second value is undefined. So it could be coming from that. It could be maybe going to hash table to do lookups or something. It could just be the first value in the hash table. So the question is, will it always return the same course ID apparently? I don't know, probably, but yes. So what I would guess is happening here is that there's a deterministic way that it's executing it. So maybe it has the values of the students and the course is sorted or something, and it just goes through them and it returns always the same value. So there's no, if maybe I loaded the data in a different order or if I added different values, there's no guarantee that it's the same value because it's undefined. It's not a well-formed query. Okay, so if we want to fix this, what we're going to do is we're going to have to introduce a new operator called the group by operator. And what the group by operator is going to do is it's going to project tuples into subsets. So we have the big, you know, multi-set, the relation that we have, and we're going to project tuples into individual subsets in order to calculate aggregates against each of the subsets. So like the example where we wanted to get the average student GPA per course, what we're going to do is again include the course ID in the select part of the query and then add it to the new group by clause that we have at the bottom. And kind of a visual example of this is here we have the relation that's produced from the join. And what we're going to do is perform the grouping. So each of these shaded groups there is going to be grouped by the course ID. So all the 15, 7, 21, 15, 8, 26, and 15, 4, 4, 8, 5. They're going to be grouped separately before we compute the average. And you see now we get a correct query result where the averages, the computing averages are grouped correctly by a force. So the rule is, that you have to remember for this, is that non-aggregated values in the select clause must always appear in the group by clause. So we ran into that problem where the course ID didn't appear in the group by clause. I've got another one. This query again doesn't make any sense because each student can only take each course once. There's only going to be one enrollment. But just to make the point, if you have that column appearing in select clause, it must also appear in the group by clause. So you can fix it by also putting the s.name down there. So are there any questions about group by before we move on to the next one? Okay, the having clause. The having clause is going to filter results based on some kind of aggregation and computation. So you can think about it like sort of a where clause for a group by. So in this query that I have up here, again we're getting the average GPA per course. And what we want to do is, if you look down here, filter it by the average GPA greater than 3.9. So we want to get all of the courses that have an average student GPA greater than 3.9. Unfortunately, this is not going to work because at the time that the where clause is evaluated, the predicate is evaluated, we're not going to know yet what the average GPA is. So if you think about in the select clause as we're trying to compute the average, each time we need to decide if a tuple belongs in the output set. How do we know what the average GPA is until we're done kind of, you know, we've done all the aggregations. We can't know during a filtering time what the average GPA is going to be. So we need to wait until we're a computed all the averages and then do a final filtering pass. And that's what the having clause allows us to do. And like I said, it's basically a where clause for a group by. So in this case, we have this having clause down at the bottom. And it says, you know, compute this query result, but then filter out only the tuples that have an average GPA greater than 3.9. Unfortunately, I think this syntax is not standard. So what you actually need to do, some DBMSes might lay, run it. What you actually need to do is duplicate that average statement. The select clause you need to duplicate again down in the having clause. I don't know why, because as I said, it's evaluating kind of as the last step of the query. So if you think about as it's parsing it, it should know what the alias as average GPA is. But for whatever reason, that's how this data is. So kind of this having clause lets you perform a filter on aggravated values. So just as an example here, if you perform the first part of the query where you get the group buys computed, you get all the averages and then you filter it out. You only return the one tuple that has the correct area. So string operations. Strings are usually simple, but unfortunately in SQL they are not. All of these different systems implement strings slightly differently. So the SQL92 standard says that strings are case sensitive and that when you are specifying strings you use single quotes. You look at kind of a list here, the one that stands out and it's probably the most annoying. And I'm not sure if this is true in the newest versions, but at least for a while, my SQL was case insensitive. So if you spelt Kanye like that, it's a weird way to do it. It still matched to all lowercase or all uppercase case insensitive matching. Some systems allow you to use single quotes, some allow double quotes, but the standard and this should probably work in all of them except for the case insensitive matching. My SQL is that the string matching is case sensitive and that you are using only a single quote. So I mentioned like operators which allow this to do pattern matching earlier and there are two string matching operators that we care about. So the percent sign that I showed will match any substring including empty substrings. So it will give you zero or more character match. The underscore character will match any one character. So in these two examples here, if you want to get all of the course IDs that have the 15-prefix that's going to do it. If you want to get the student logins that are like percent, so any substring at C underscore will get you a single character match there. So it could be CS, CA, CBD, whatever. The second piece to string operations in the standard are these different string functions. So the problem here is that a lot of DBMSs have their own dialects which makes switching between them or working with different ones kind of tricky. But they can be used in either the uploader or predicate string in your creator. The question is, is an empty string the same as null? I cannot speak about every database management system's implementation of it. But in the general sense, and I think according to the standard, an empty string and null are different. So an empty string is just a string of length zero and the percent sign will match to it. A null string is, or sorry, a string column that has a null value specified as null. The value is just undefined. We don't know what the value is. So it doesn't have a length. We don't know what it is. So string concatenation. Again, seems like it should be pretty easy. I have three different syntaxes up here for three different databases or database management systems. The SQL standards is used as kind of double bars, operating different concatenate to more strings, but pretty much every system you look at is their own special way of doing it. I don't know how many support the SQL standards effects. I can show you a few examples here. So in SQL 92, we should have these double bars. So if your system supports SQL 92, then it should support the double bars. Let's try these three that I have and see if they do. So we're on Postgres here. I'm going to paste in this query. So that's just going to concatenate the two strings data and base together. I'm going to call it a startup. Let's see if it works. So Postgres has the double bar operator that is SQL and we two could find. So if we switch to my SQL, let's see if they do this. They do. It doesn't show me an error, but the answer is zero. I don't exactly know what that means, so let's say where they can concatenate the string data with the string base. It's a zero. That's a little weird. I guess another option we saw in the list was the plus sign. Let's do it again. I guess I don't really know what it's doing. I don't know what this means, but it thinks the result is zero. So let's try my SQL as this concat operator. So concat works with my SQL. So my SQL's special unique syntax for string concatenation is this concat operator. And let's try the last one here, SQL white. So SQL white supports this double bar. If we look back on this, we can see there that if we had Microsoft SQL server, they actually have this plus sign. And my SQL only knows what concat means. So again, this is one of those things that's kind of tricky to be careful about which specific dialect we're using. So daytime operations is another whole dumpster fire of unique dialects. So daytime operations are obviously used to figure out the current date, differences between dates, how many days it has, that sort of thing. You can use again, you can use them both in the output and the predicates. And the syntax varies widely. So just as a quick demo here, I'm going to show how you get the number of days since the end of this year. Again, it seems pretty simple, but as we will see in a second, it is not. So we're back at Postgres here. Let's try out if we can get now. So select now gives us the current day on our networks. Let's try now on my SQL that works, that is today. And let's try now on SQL white. OK, SQL light doesn't know what the function that now means. So you will try a different one. Select current timestamp. Doesn't know what select current timestamp means. Maybe our last option is just select timestamp, not a function, just a system value. OK, it knows what the current timestamp is. So kind of Postgres and my SQL support analysis impacts using SQL light, which I think you are for the homework assignment to get the current timestamp. You need to use this current timestamp now. So back to the original question, which is how do we get the number of days since the beginning of the year? Well, let's figure out what is today. So that gives us one. We extract the day from the month. That's going to use the value one. And if we want to figure out the difference here between two dates in Postgres, we can do it like this, is going to give us the difference between the date, today's date, and the year. So that works fine here. If we switch over to my SQL and take a look, it's going to tell us 800 days. I don't know how we get 800 days a year. I think there is some kind of weird differencing that's going on in my SQL. Someone answered this online somewhere. I think the first digit is the difference in the month. So September 9 minus January 1 gives 8. And then the difference between the two days, the first day of the month, 01 minus 01 gives 0. So I think that's how they're getting their 800. I think we may be tested out on a different day tomorrow. So if whatever that guy online said was true, then we should get 801 there, right? I guess, well, okay, so I guess you can believe some things you read online. Okay, so this clearly doesn't work to get us the number of days. So another way around this that I figured out here is that we can get the UNIX timestamps of the number of seconds or since the epic. January 1st, 1970 or whatever it is, get that for today's date, that for the date of the year, round it, and then multiply by 60 seconds times 60 minutes times 24 hours a day, and hopefully that's going to be this great 243 days as we expect. MySQL actually also has this date-diff function which makes it a little easier and that also gives you a 40-day. So SQLite, which is, again, I think what's being used for the homework. Let's try this out and see if it works. Okay, it thinks that there's zero days between today and the year, which is obviously not true. So for SQLite, what you're going to have to do is use this Julian day function. So you convert the current timestamp, remember, we don't have now, I have to convert the current timestamp to the day in the Julian calendar and then subtract it from the date before the year. That's going to give us 243 days, and hopefully if we round it, we will get 243 days. Nice, okay, great. So kind of, you would think something as simple and as frequent, I think in a lot of applications, date manipulation is something that comes up a lot. You would think that this should be pretty standardized, but as hopefully this has demonstrated, all of these systems are all over the place in their implementations. Okay, so any questions about the date and time steps? Since I think there are a few homework questions about that. All right, so I think we're running worldwide here, so I'll try and speak up a little bit. I'll stick around after if people have questions. I need things clarified. So output redirection basically allows you to store query results in another table. So essentially what you're doing is you're using this into keyword, which says select all of the distinct CIDs from enrolled, and we're going to put them in the course IDs table. So it's going to perform the query, select distinct CID for my enrolled, and stick them all in a new table with defined course IDs. MySQL has this kind of weird syntax which you create table course IDs based on this query. And I think there are some other different syntaxes that are out there that you can support. But the first one is what the standard specifies. So there's also the ability to read and create a table. You can just bulk insert basically a bunch of tuples from query into another table. And the inner select clause has to generate the same column, so the schema of the query needs to match the schema of the table you're trying to assert. And different DBMSs have different ways of handling different integrity violations. Imagine, for example, you specified that you wanted your student ID or student login to be unique. And you tried to insert a bunch of tuples of duplicate values into the table, then different DBMSs would handle that differently. So some will throw an error on the first duplicate they find. They'll roll back and remove all the tuples that the query was inserted. Some will throw an error, but keep all the work that they've done up to that point. So you can stop that way through. And some are just going to ignore the error and keep going. They'll give you a warning or something in the end that says, hey, there was this number of violations, but we inserted all the references that we could. Sort of a difference. This is output redirections. It doesn't have to go to the terminal. It can go to another object in the database. Output control allows you to essentially format or reorganize the output. So this order by statement allows you to perform sorting. Remember, we talked about SQL is an unordered multiset or bag. So there's no order or relationship between the tuples in relation, but you can impose an order in a query based on some call. So in this case, we're selecting these student IDs and their grades, and we're sorting it or ordering it by grade. So by default, I think it's going to produce it in an ascending order. So these are like some graphically sorted. So all the A's first and then other grades. So you make this a little similar. You can also say order by one. And what that's going to do is going to order by the first column. So they'll sort it by the student ID. If you instead wanted to, for example, order by grades descending. So in reverse order, you specify this descending key word. So order by grade descending. And you can again have a comma separated list like the other operators. We're now we're sorting by first by the grade descending and then by the student ID ascending. So that'll end up here with first we'll do the grades, figure out that sort of order. And then if there are ambiguities based on the grade, it'll sort by the student ID. And again, you can mix and match these things here. You can sort by the grade descending. You can use one to specify the positional offset of the column. Okay, so here's the order by which gives you a sort order for your output. The limit clause is going to restrict the output to the number that you specify. So for example, in this case, if we say women 10, it's only going to give us 10 results from this query. So this is like to see a web page where it has new results. This is one way of doing it. You can specify I only want the first 10. And if you want to get maybe the next 10, you can specify, okay, give me limit 20, but starting from offset 10. So in this way, you can kind of control both the number of two poles that you're getting back as well as the starting position in the result. And kind of the tricky thing you have to remember here is that because there's no order, this isn't necessarily well defined. There's no guarantee that you'll get a disjoint subset for these two queries. What you need is if you had an order by clause, then you could guarantee disjoint outputs. So the next piece, this little tricky, is nested queries. They're often really difficult to remember. I talked about this query optimizer. They're often really difficult with the query optimizer to optimize. Usually it tries to rewrite it to something that it's a little better at. And inner queries can appear almost anywhere inside a query. So think of this like a function that returns the result to the outer query and then the outer query works fine. I'll show you an example here. So imagine we have this inner query. It's here in the parentheses at the bottom. Select SID from the roll. That's going to get us all of the student IDs. So think about it like the result. All of the student IDs from the roll table. The outer query is going to then get us all of the names from student where it finds an SID in. So it appears somewhere in that inner query result that we computed. So the way this executes is first. We compute the result of the inner query, the nested query, and then the outer query can reference the results that are produced there. So again, let's get the names of the students in 15, 445. So if we want to do this in a nested query format, what we want to do is figure out, okay, select the student IDs from the roll where the course ID is 15, 445. And the way that we're going to figure that out is by saying, okay, where the student ID is in this result set that is produced by the nested query. So these two SIDs, SID here in the nested query is going to show up from the SID from the roll table, whereas SID in the outer query because of the query scope is going to refer to the student table. So kind of, we saw that example with joining. This is essentially performing a join between the student and the roll to just re-written in a little bit different way. So the different types of operators that we have for the nested queries are these four. So all means that you must satisfy the expression for all rows in the subquery. So the subquery produces a result. However, your outer query is that result has to satisfy all of the rows in the subquery. Any operator means you have to satisfy at least one row in the subquery. So the at least one row that shows up, the in is equivalent to any and exists is similar as at least one row is returned. So there has to be at least one row that exists. So just as a quick example, if we want to get all of the students in 15.445, we issue this query that says, get me all the student names where SID equals any and then from this subset that we've generated of all of the student ID is primed and rolled to table. We only have a few minutes left here. So I'm going to get to the next pieces to make sure that you can do the homework. And again, I will stick around a little after to answer any questions. So this not exists. I mentioned the exists query. So we're there. You want to select star from the course where at least one tuple exists in the output set that gets generated by using the not operator to invert it. So you're saying where something does not exist. So this find all courses that have no students enrolled in it. So you're saying essentially we don't we don't want there to be any tuples. There should be no tuples that exists in the subquery that we're providing here. So for example, if we do select star from rolled where course ID equals the rolled ID is going to produce nothing. And then we're starting to going to produce the advanced topics database course and they're going to be no students that exist or a role in the course. So this course ID here refers to the course ID. I want to get to these last two topics because they're important. So window functions. Window functions essentially perform a sliding calculation across a set of related tuples. So you might think, okay, it sounds kind of like a group by aggregation. But they're in this case, they're not being grouped into a single aggregate value. So this is useful, for example, if you want to do something like you're analyzing time series or something, maybe the moving average or sliding average or sliding window over some data like that. So the syntax for this is that we're going to apply function name, select function name over some partition from the table. So you can think of the over piece like a group by and I'll explain what this means in a second. So the aggregation functions appear there and the over functions is how to slice up the data for your partitioning. So the aggregation functions, what can this be anything we discussed earlier? So, you know, average, min, max, count, all that stuff. The window functions, the special specific window functions that we have are row number. So that's going to give us the number position of the current row and the rank. So that's going to give us the order position of the current row. So row number is the total overall number of the results. And the rank is the position within a partition. So if you recall, like I said, the SQL semantics have no sort order on them. So there's really no way to enforce this without kind of these window function operations. So these enforce some kind of order on the tuples in your result. So just as an example here, this is going to give us the row number over no partitions. So it's empty, so you just want to give us the row number over the whole relation. So if you look at the results there, they're numerator in order of the row number there. So that's what that row number function reduces. So like I said, the over keyword specifies how to group the tuples together, and the partition by keyword gives you the group that you were calculating. So in this example, we want to get the row number over, and now we specify the partitions from a partition by the CID or the course ID. So what it's going to give us is the students, the row number of each student partitioned by course ID. So if we look at it, you know, we get these three different course IDs, and we can see kind of the student row numbers are only, the scope is only per course ID. So the first partition is for 15, 445, row numbers 1 and 2, second partition, and the third partition that counts the row numbers reset for each one. You can also specify an order by, and like I said, this gives us the ordering, so then we'd be able to get a distinct order. So in this case, we're going to order by CID for the partitioning. So just as a quick example here, we're going to find the student with the second highest grade for each course. So we have to combine a couple of things. We want to get the rank, which is the local partition, local position of the partition for each record. We're going to partition by the course ID. We want to order by the grade of ascending. So what this is going to do is going to sort that each partition by the student grade ascending, and we're going to take the second one. So it's ranking that rank equals 2. I know I'm covering this really fast, so I apologize, but I really need to get to the last part here for the homework. But again, if anyone has any questions, please stop me. I'll try to answer them as quickly as I can. Okay, this is the last piece, and also these are really cool. So comment table expression is essentially a right-of-way to write auxiliary statements for use in a larger query. So imagine it's going to create a temporary table that it can reference later in a query. So it's sort of like an alternative or a different way of specifying nested queries and views. So the syntax here is like this. So with CTE name, specify is just an alias that you would call it, and then you specify a query that you want to define as the comment table expression. So in this case it's just a specifying slate. So it looks kind of like a nested query, but what this is doing is it's essentially, like basically you can think about it as computing a result set up front that you can reference later in a query. So you have a handle to the CTE name that you can then reference like star from CTE name. So I'll try to show some concrete examples here. You see the name in this reference gets used later in the query. So you combine individual columns using this notation here where you want to reference CTE name column one. You can rename them however you like and use them in later columns. So later in the query. So it's example like in the bottom query there you end up adding column one and two together and you get three based on the row's return from the CTE. So just as a concrete example of this, you want to find a stoop record with the highest ID that's enrolled and at least one course. So with the CTE source as defined as the name of the CTE, you have the max ID and that query is just all it's doing is selecting the max student ID front end role. Then in your later query, what you're going to do is select the name from the student and CTE source where, so this is performing essentially a join, the student SID and the CTE source max ID that you calculated in the common table expression you indicated. So again this CTE source is referring to the value that you specified out there. So I think this is going to be the last thing you're just recursion. Really cool. Andy said that I have to show you a demo so I have to show you a demo. So CTEs allow you to essentially implement recursion in single words. So the flexibility it gives you, this allows you to kind of compute an arbitrary recursive function or a loop if you think of this. So for example imagine you wanted to write just a four loop in Python to go through and count all the numbers from Python 1 to 10, or print out the numbers from 1 to 10. This is how we look in the, using the CTEs SQL syntax. So you have this recursive source called counter and all it's doing essentially is saying select one union with one or sorry the counter plus one. So if you think about how recursive calculation work, you know you keep calling down the stack until you get to the base case in which case the counter will want to return when the counter is more than 10. So it's going to keep calling the CTE and uniting the results until we get to 10 and then it'll stop. We give this the answer back. So we run it, we see kind of what we can implement. Kind of an arbitrary recursive or four loop function using SQL, which is, I will give it to Andy. It's pretty cool, but cool. So just to conclude really quickly, SQL is not a dead language. It's constantly evolving. The standard is still evolving. New things are being added all the time. Every system has its own weird quirks or dialects. So you have to be careful switching between them. Most always try to compute your answer as a single SQL statement. So when you write a SQL query, it's always best to try to get all of the computation that you have done in one query. Because otherwise you're going to end up going back and forth to the DBMS multiple times and can't optimize across all of them. So I will answer the question once I can just add the homework. Homework was released. You just have to write some SQL queries It's posted on the website. And it is due on Sunday, September 12th. And remember, there are no late days of homework. So you start by writing some videos. Thank you for watching. See you next time.