 back from lunch. Can I speak to the room host? Is he here? All right, good afternoon everyone. My name is Bruce Mamjan and I'm raring to go. I am one of the Postgres core team members and have been working with Postgres since 1996. I work for EnterpriseDB, one of the Postgres support slash product companies. And I'm excited to be presenting to you a tutorial about common table expressions for the first half and window functions for the second half. This is a three hour tutorial. It goes to 4.30. We will take a break in the middle, probably between the two presentations, pretty much 20 minutes or so, and kind of give you a little chance of decompressing as we move from one topic to another. The reason I'm excited about this talk is it takes probably the two most powerful advanced SQL features and presents them in a single talk. This is an excellent talk for people who are doing a lot of development work using databases, application development, data warehousing, analytics, that type of thing, and gives you a good launching pad to understand how you can make the most best use of your SQL database. Now, all the examples I'll be talking about actually are related to Postgres. We've had these features for several years. In fact, probably five or six years. I have been told this morning that MySQL implements a lot of these features as well. In fact, Dave, can you address that at all? Okay, so I'll just repeat what he said. He said MySQL has set CT since when? I'm sorry, 5.8? CTEs are version 8 or window functions? Window functions and CTs both in MySQL version 8. Okay, great. So, as I understand, you use Postgres' features as an example. So, what I'm showing you should work in both databases. In fact, these are SQL standard features. So, if you're using any SQL standard database that implements the SQL standards for these particular options, you should be able to use them very capably. Yes, sir. Thank you. Thank you, Dave. So, again, this is really two talks. Again, both of them have a similar theme of making the most use of SQL and the current sort of standards that have been adopted in recent years. The slides that you're going to see actually are on my website. So, this website right here is actually a place you can download the presentations if you have a desire to do so, and I'm going to go down for a minute. Let me see if I can focus this a little better. It's a manual focus, I guess it is. Oh, that's just the zooming. Anyway, that's probably as good as we're going to get. But, again, feel free to download these to this presentation as well as the other one. There's really 20 or 30 presentations here on that website related to Postgres, and also there's some security talks and many of them have videos. So, if you find something interesting, you may find there's a video that goes with it, and you can hear me droning on and on about these particular features. Again, we will take a break between the two presentations, and again, this is ideal for developers. Now, there's two ways I can present these talks, and I'm going to take a little hold of the audience to decide. One way of presenting this talk is for me to be speaking and you to do Q&A. I answer questions about how certain things are handled or how would I implement certain things. In that particular example, we would basically talk about some slides, again, take questions as we go, talk about some more slides, take questions as we go. We wouldn't take any breaks for you to actually practice some of these features. Okay, so that's the first one. When we do mostly Q&A, we don't take breaks for you to actually practice. The second option is where we will take breaks after each section, maybe 5, 10 minutes to give you a chance to actually practice the features that you've already seen. Now, of course, because these slides are already online, you can almost cut and paste the examples into PSQL or the PG Admin for you to kind of practice it. So the question I have for the audience is I'll look for a raise of hands. The first option is mostly Q&A. It's all Q&A with no breaks for actual hands on. The second one is for breaks for hands on. So how many of you would like no breaks for hands on and do all Q&A? Okay, that's about 10. Okay, how many of you would like to take breaks so you can practice what you're doing? Okay, I think the first group one, by a slight margin, sorry for that, what we will do is that 20 minute break that we take in the middle, that will be your chance to practice the previous section. And, of course, I'll be here at the end if you want to take another 20 minutes after to practice, I'll be glad to stick around to do that. So that's how we're going to kind of work it. Again, some groups, as I've done this twice, one group shows the one almost universally and the other universally. But, again, it has to do with type of audience you have. Again, this is a tutorial so any questions you want to ask, feel free, yell out, raise your hand, whatever you want to do. Does anyone have any questions before we get started? Yes, sir. You can't find the presentation. Okay, so if you go here, this is actually under SQL. There's an SQL sub-tab under presentations. That's exactly what it says up here. Right? Okay. And, again, the reason for the confusion he's right is that when I took the two presentations and combined them in the tutorial, I couldn't say programming, yes, common table expressions and window, posters, window function magic. It would just be too long. In fact, if you go look at the screen, the title doesn't fit on the screen of the scale. Then it just goes to dots because it's too long. So that's why it kind of got compressed. Other questions. That's a great question. Okay. I do think you're going to get a lot out of this. I really enjoy these talks. These two topics are really kind of close to my heart in terms of really empowering developers because a lot of developers have a tendency to look at databases as a key store or as just a data store for them. And when they approach a database, they're just basically thinking, okay, I'm going to put the data in. I'm going to get it out. And everything that I want to do, I'm going to do in my application. That is the way we did it in the 80s and 90s. That is no longer the most efficient way today, but that might be the way you did it in the past or that might be the way you were taught in college. Or you might have worked for an organization that did it that way. And frankly, if you're using a relational database as just a data store, you're really not getting the full benefit out of the database. And the reason is because these databases now, we're using the current SEO standard, are really giving you a whole tool set of things that make your life easier as application developers. Okay. It doesn't make sense for you to reimplement a whole bunch of things in an application and perhaps implement it over and over again in different languages in different applications, when you can actually structure what you want as an SEO query and give your problem to the database. Right? Give your problem to the database. Let the database solve your problem and give you the results that you want. And that's I think the major theme I'm going to be talking about, not only with common table expressions but with window functions, the idea that you can now give your problems to the database, press the SQL in a way that it understands and let it give you the answer that you want. That makes you more efficient, that makes your application less bulk prone, that makes your application smaller, that allows you to get more work done. And I think these two topics really work well in that kind of thing. So being prepared as we go forward to say, oh, I didn't know I could do that, or oh, I've already done that in my application, now I don't need to do that anymore, I can rip out all that code and actually express it in SQL, give it to the database engine, let it give me the results that I want, make my life easier. Right? The reason relational databases have been around so long is not because everyone likes to spell out the word between or select, or what some of the other really long ones, some of the key words are really long. Nobody likes that, but it is a very expressive language and it's a declarative language, which I'll talk about in a minute, which allows application developers to be very efficient. And that's really where you that's where you distinguish yourself as an application developer from someone who's just a grunt who's going to throw the database, into the database and pull it out and then go through all sorts of gymnastics in your application and somebody who really uses that database for what it's good for and gets things done well. And I think that is a real area that we're lacking and I think these two talks are going to help with that. So let's get started. So the first thing we're going to talk about is the concept of imperative versus declarative. That is sort of a general concept that I think is interesting. Then we'll talk about the syntax for common table expressions. Then we're going to kind of go sort of up the ramp, get harder, and we go to recursive common table expressions. This is where the SQL starts to be programmatic or it allows you to sort of do things, push a lot of your work from the client application into the server. And then section four is where we kind of hit the cruise control and we go through a whole bunch of examples using common table expressions showing you how you can do a whole bunch of problems again using this feature. And then finally, section five writable common table expressions. This part is probably specific to those graphs but it allows you again as an application programmer to bundle a whole bunch of queries together and issue them as one statement. And I'll talk about why that's valuable. And again, I think you'll be impressed. Any questions? Great. Okay, first imperative versus declarative. And we have a gentleman with a baton kind of instructing the quartet here how to play the music. Imperative programming languages. These are languages you're probably familiar with. See Java, Perl, Python, JavaScript. These are languages where you tell the computer what to do. You are instructing the computer, conducting the computer, go do this loop, you know, assign this value, here's the fifth statement, do this other thing and you're telling the computer what to do. And this is the way most applications are written in imperative languages. And again, that's the methodology that you're normally thinking of as an application programmer. I need to do X, Y, Z. How do I get that done? However, SQL is not an imperative language. It's a declarative language. What is a declarative language? The declarative language is one where you tell the computer what you want and the computer decides how to give you the answer in the most efficient way. Doesn't that sound great? All of a sudden you're not having to go through every little if statement and every little conditional and you know, call your functions and all that other stuff. You're actually saying, I want this, you figure it out. And there's a whole bunch of reasons that's great, particularly because the system knows the data layout, the system knows how common certain constants are that you use in your query. I'm not going to get into that. I have OptimizerTalk that explains that. But the server, the database server knows a whole lot more about the data than your application does. And it is able in many cases to give you your results much faster than if you took a baton out and tried to instruct what you want the server to do. And that is why SQL is so powerful. That's why it's been around for so many years. And that's why it probably is going to be around for another 30 years. Or 40 years. 50 years at this point? 40 years. So this is typical imperative language. The first example is in basic. I'm showing my age here. That's what we used to do. If you ever want a program that just says hello over and over again, that's what that does. There's an example in C. There's an example in Perl. You're actually in telling the computer what to do in this case, say hello over and over again. This is a declarative language where you're just saying give me hello and then give me another hello and so forth. It's not easy to do an infinite loop like I could in simple SQL become table expressions make that possible. I'll be showing you that in a minute. Just be aware, imperative languages typically applications are written in imperative languages. Client applications typically use imperative languages. Server-side programming typically done in imperative languages. And the one place that you can do imperative languages in a database is using comment table expressions. The reason comment table expressions are a new way of doing SQL is because it takes a declarative language, which is SQL, and it gives you an imperative way of controlling it. So imperative is not bad. It's just that a lot of cases declarative is easier for you as an application developer. There are a bunch of cases where you need to comparatively control that database. And that's what comment table expressions give you. It's a way of almost taking SQL and programming it similar to how you program in a language. And I will show you a whole bunch of examples where it is actually tremendously powerful to be able to comparatively control SQL. Which by nature, normally by default is a declarative language. Any questions? Let's look at the syntax. This is the comment table expression syntax. Looks very simple. Effectively the word that always starts a comment table expression is the word with. So the word with basically indicates I'm about to start a comment table expression. Then you have an optional keyword called recursive. I'll talk about what that means in a minute. And then we have an optional query name. And we have a query name. We have the supply. Then we have an optional column name over here right here. Then we have the word as. And then we have effectively a select inside this comment table expression. I realize it doesn't make sense now. Give me a minute. I'll show you some really good examples. And then once you're done all that with stuff there's a final select down here which says here is my main query. So every comment table expression you're ever going to see always going to start with the keyword with. It's going to have multiple one or more sub-queeries or sub-selects or sub-queeries. I don't want to use the word sub-select. With selects inside. And then there's always going to be one at the end. We're going to call it the main one. So you're going to have some slums at the top and then you're going to have a main one at the bottom. That's always the structure you're going to have. Any questions? This slide. These queries are very complicated. So keep your eye on the red parts. If you see a query and you're like where should I be? Where should I be looking? Look for the red part. That's what that slide's for. We got red part right there. So now all of a sudden you see it and you're like oh! I want to look. I want to look at that red part. So here is our first SQL comment table expression. Starts with the word with. Remember I told you that. It has a query inside the with. And then it has a main query here at the end. Follows our syntax. Or keyword with. It has one or more of these selects inside here and then a main query at the end. What can we learn from this? Well first off we've named this as source. In fact this comment table expression right here is called source. And in fact I'm referencing the same name down here in my main select query. And this is a pattern you're going to see over and over again. You have a comment table expression. You give it a name. And then in your main query you reference that name. Now this is a completely contrived example you would never do this. But it is helpful to see what it does. You have select one is my sub query. So a row with one. It's called source. And my main query what is it doing? It's saying select source. And what do I get out? One. Because that's indeed what my queer sub query said. If I had a two here a two would come out there. Okay. Again we're going to start really slow. And then we're going to work up. So don't think this is a baby class. Because by the time you get done you're going to be like your brain is going to explode. But right now we've got to start slow so we can get up to the complicated parts. And the window functions the same way. First 30 slides you're going to be like wow we can get everything useful. And then all of a sudden it starts to be like whoa where did that come from? And then to the end you're like wow I've heard that many times. In fact a lot of people said that we thought we knew these features until we saw this talk. And we're like okay now now we know it. We thought we knew it before. Another thing I should point out is that this talk and the window function talk were both written first as SQL. So what I did is I wrote the SQL I wanted to show. I ran it and then I copied it into the presentation. So this URL right here is actually the SQL from this presentation. So if you download that SQL and just run it in PSQL or in pjab and you'll see the whole presentation fly by on your screen. Just hit the pause button. But that's the entire presentation right there. And again if you go to a website and you download these slides you just need to click on that. It's everything in pink. You click on it it opens a browser to that URL. So you don't even need to copy it on it in the PDF. Any questions? Okay here's a new one. So same structure starts with with has the outer query select uses the same name. But what I did here was I labeled the column. Do you remember I'm going to go back to the previous slide. You see how this has this weird print seek unknown column thing. It's like hey you didn't give a label so I'm just going to give it this question mark. But over here I actually cleaned the label. So now when I do it I actually get a label here and I can actually reference source.col1 and in fact gives me a thing. I can't do that here. I can't reference the column name because I didn't give it one here. So I just get this amorphous question mark or something. But again this is normally the way you would do it. You label your columns so you can reference them in that main query. Here's a new one. Instead of using the label here I put it up here. You might remember that from the syntax. So instead of putting the label in the select I put the label in the comment table expression. Either way doesn't matter. I get the same result. Right? Make sense? Third one. This one we're going to basically push every feature or label feature we can. Label it as call1 in the query inside the comment table expression. We're going to make the comment table expression call and call2 and then in the outer query we're going to call and call3. Notice I had to call and call2 here because I had to reference this one. And then it actually comes out as call3. So I'm going from call1, call2, call3. Same data. I'm just labeling it three different times. Alright? This is illustrative. Again, we start really slow. But we're going to get some really crazy stuff. If you don't see crazy stuff by the end I want you to tell me. We can do more than one column. We can do select1, and now we get two columns. We're not limited to one column. We can do two columns just fine. Any questions so far? Okay, I know. We'll get there. I want to give you a little refresher on Union because Union ends up being used quite a bit with comment table expressions. You'll see why in a couple of minutes. But Union is kind of our glue to kind of bind some of this stuff together. Union's been around for a long time. Again, it just was very convenient to be used with comment table expressions. And what I want to remind you is the handling of Union. And I think this is sort of an unfortunate but probably logical behavior. But when you use Union on its own it will automatically remove any duplicates. And I'm sure the FGO Standard Committee has wonderful reasons for this. But it is kind of annoying. Here I'm selecting one comma one and I only get one out. If you use Union all you don't get the duplicates. So just a reminder, if you ever use Union the first thing, if you type the word Union, this is the way I do it. Every time I type the word Union I have to think, do I want duplicates or not? And the big complaint I've heard is normally in SQL you add keywords to add features. Here you have to add keywords to remove functionality. You have to add the keyword all to remove duplicate removal. Which is not super logical. I understand why they did it that way. But practically it is an issue. Just be aware of that. Union all. In most cases when I'm writing SQL I want Union all. And when I was an application developer I remember one or two times when I'd be called in and they'd say this report is not showing the right number. And I'd be like okay I'll look at it. And I'll come back like two hours later and I'm like I'm sorry customers they had the same amount. And the dupes got removed and then what I do to solve it just came out wrong. Right? Imagine because you have two counts the same you know. Anyway be aware Union all is something you often want to do. Oh so here's a new one. Now I mentioned that you can have more than one table expression and this is our first example of that. So we now have a select that returns one two. And we have a second select that returns three four. The first one is called source just like the previous one. This one is called source two. Okay? So we have select star from source Union all. I didn't really need the all here but I just did it because I didn't want duplicate removal more of a note to myself. And I say select star from source two and I have one two and three four. Okay? So I've now taken two common table expressions pasted them together with Union all basically. Alright? I'm going to take two common table expressions with actual tables because I have a constant set to this point. So here I'm saying select language name language role from PG language and I'm doing a join here. Okay? And then I say select star from source and I get internal. Basically this is completely useless. I'm taking a query and I'm giving it a name and then I'm just selecting the data. Right? I'm not doing anything but hey it's a good illustration. This is the first one that actually came in with common table expressions. Praise God. We have a query here. We call it source. We're doing our join. We're getting the language names. I'm saying select star from source Union all. I need Union all here. The all part. Select the main language name null from source. So the first four rows are these. Okay? Ordered by language name and then I have the an additional row that I didn't have before. Alright? I don't think you'd ever want to do this but if you did this would make sense. Right? You can kind of see I got my query your query could be huge. This is just a baby query. I could have a huge query and now I can do like different analytics on it and return different rows of different types of data in it. And I'm only running this query once. That's one of the key things is the way Postgres implements this. That top query is fully run before we do the bottom part. Okay? So if you've ever had a case where you have to take a big query and run it one way to get one type of answer, another way to get a different type of answer, this is your solution put that big query into common table expression and then you can do all sorts of manipulation with it on your own. Okay? Here's another one. Here is a select from PG class and now I am actually calling a class and now I'm going to join this result to PG attribute and then return five results whatever. Okay? But here's the first case where I'm actually joining to a common table expression as though that common table expression was a real table even though it isn't. Okay? Here's another refresher. Case statement. I'm not sure how many of you have used case statements before but a case statement is basically a conditional that you use in SQL. So for example this case statement right here is going to print positive, zero or negative depending on the value of call. Now this is a stupid example but again it's like an if statement for you. Yes sir? Yeah. So this is an excellent question by this gentleman he's saying we have a limit five down here but we don't have any limits up here. So how would you actually have to execute this entire query instantiated in memory or still with this and then do the join down here? The answer is yes. The common table expression is what we call an optimization barrier that is actually good in some cases and dead in other cases. There's some reasons why we have to make an optimization barrier for some of the later features we have. We've talked about removing the optimization barrier which would allow the query to move around. We haven't implemented that yet but I think one day we will. We've debated a couple of times and I think we're getting closer to doing that but the idea of an optimization barrier in this case is what the problem is and in fact the operational problem is you do not know, well I know they're all going to join. Actually I don't know they're all going to join. Yeah, I do know they're all going to join. So I know they're all going to join. So effectively I could probably move the limit up there as well but if I didn't know they were all going to join if I put limit five up there and only four of the rows join they only return four. So yeah, this is a problem. Great problem. Another thing that actually people have asked about a lot is is there any overhead to these common table expressions? Not really. If you realize they're an optimization barrier there's really not a whole lot of downside to them. Same thing with window functions. There's really not a lot of performance impact or a reason you would avoid using them except again because of the optimization barrier issue and just be aware of that. Okay, other questions? Yes sir. So what is the difference between having a view versus a common table expression? Great question and this is actually we can illustrate it right here. If class was a view we have a, well let me back up. So queries in Postgres go through four stages. Parser, re-writer, optimizer, executor. In the re-writer second stage we take views and expand them into the places they're referenced. So what would effectively happen here is this query would get basically expanded like a macro right at that spot and we would be able to optimize and remove and move around where clauses and stuff like that in this particular case. So I would definitely argue that view is going to be much, much better in terms of optimization. Great question than trying to do a common table expression here. Yeah. In fact people often use common table expressions when they want the optimization barrier. They may want operationally that one query to run before the rest of the query but if it's a case of just I want to encapsulate this thing, you're better off using a view. Yeah because in a view we don't have the optimization barrier and we can move around stuff and we get much better query plans. Great question. Other questions? So again back to this conditional here, the case statement. It's basically like an if statement and you're going to see some examples of this. Okay. Any questions? I actually just added this slide. If anyone wants to explain why this slide is recursion during the break or is that anyway? There's actually a reason why this is actually a perfect recursion example. Yes. So the question is would we conclude that views are better only because we haven't yet used any complicated things? And the answer is no. In every case of view is going to give the optimizer more opportunities to move things around. The reason where you're going to use and again what it is is you haven't seen yet why this would be useful. Right? And you're going to see that in these coming ones. Right? But effectively the reason the common table expression is useful is because it gives us that imperative control. And we're going to see from here on a whole bunch of cases where that imperative control is crucial to doing what we need to get done. Up until this point you're absolutely right. There's been absolutely no reason to use common table expressions in any example that I've given you. Thank you. Very good. Other questions? Great. Okay. So recursive. Again before we start going up the hill as I said in the beginning. A recursive common table expression is one where you are potentially executing the SQL multiple times. You don't have to but you're potentially executing the SQL multiple times. Now in this case you don't. I've added a keyword recursive here. You might remember it from the SQL syntax I showed you earlier. Remember that? Okay. So we have that keyword recursive but effectively it's the same query. We have select recursive source as one and we turn one. But it's not very powerful. It's kind of where we start to go way up. This is where we actually see a common table expression that effectively goes into an infinite loop. Now I don't recommend you writing common table expressions that go into infinite loops. There's probably a little value to that. Okay. But you are going to see a whole bunch of cases where the ability to do looping in SQL in an imperative way becomes very powerful. Is the mic working? Is everyone hearing back there? You're good? Okay. The perfect mic is the one that I can't hear, right? I'm just making sure everyone can hear clearly. Okay. So what I actually did here was I said set statement time that equals one second. So it canceled itself and I used with and I used the word recursive. And this is the structure we're going to see over and over again with recursion. There's this sense that you have to prime prime the recursion. Okay. In this case I'm selecting one. That's my prime to prime it. And then I have a union all. Again I talked about union all before to paste the second query that references my own common table expression name. This is where the recursion part comes in. So we have one become source. The second query returns one. Another one goes up. But all of a sudden we're infinite loops. This is effectively what's happening. You select one it goes into source. You select from source. You have another one it goes up, down, up, down. I can't get out. Okay. And this is the pattern we're going to see over and over again. Here is another example. I am now displaying the word hello over and over again. Remember the example I showed you earlier with imperative where you just said hello, hello, hello, hello. Effectively I have hello. I select source. Source is hello again. Hello, hello, hello. It doesn't stop. Not very useful but that structure we're going to see over and over again. Okay. This is interesting. This is interesting. It's exactly the same as this query except for what? All. Thank you. See they all disappear right here. And there it is. And it's gone. And hey I labeled in red so I don't probably realize that's what was going on. And what's really interesting is the hello goes into source. The hello comes down and when it realizes there's already a hello in there nothing gets added so it returns just one hello. Okay. An example of almost every case you're going to want to use union all in these recursive queries. Yes sir. So in the different layers that I've talked about where does that optimization happen? It happens very quickly. So you're absolutely right. Does it generate an infinite number of hello's before it terminates? It's really your question. No the parser does not catch it. What actually happens is the executor executes the hello. It then goes into source. It loops around again. It then takes the hello and it tries to insert the hello into the existing batch it's holding. Because remember it normally thinks it's going to run and create like an in memory or just spilling copy of source. Right. And because nothing gets added it knows I don't need to come back here again and it just exits. Great question. Other questions. I have to download the slides. Yes you're absolutely right. This gentleman is right on it. There is no where clause here in this second part. Okay. And because there's no where clause effectively it tries to go forever and frankly how would you get out of it and the way you get out of it is a where clause. Okay. And here it is right in red even. This is a query that counts from one to ten. If you think of imperative languages when you learn imperative languages print the numbers from one to ten do a factorial. I mean all this academic business when you're learning a language we're doing this now in SQL. In fact I have a factorial example coming. So I'm going to go through CS 101 here. So we have select one which is what we call our seed value. Starts for counter primes it. Okay. One goes up to the counter. Now I have select one and I check to see if it's one less than ten. Yes I return one plus one is two. And then I go back up here two, three, four, five, six. When nine comes down nine's less than ten I return ten. Ten goes up. Ten less than ten no. So I don't return anything here and I exit that comment table expression and pop down here to return the result. Okay. And that is the result set. Now we do have an easier way of doing this in Postgres Generate Series makes it very easy to go from one to ten but you get the idea. And this is how you would do it in Perl. Again for those of you who know Perl it's just an example. We have somebody who knows a lot about it here and maybe Randolph would know. But anyway we have a little example here for people so you can have a framework of what we're doing. Okay. In fact this actually example is not that accurate. This one is accurate. And I'm not sure he caught that or whatever. But in this case you're going from one to ten and you're not really queuing the results you're printing them as you go. Okay. In this example I hope it's correct. Effectively what you're doing is you're creating a function f and then you're priming it with one. Right. And the one is going up here. It's going into arg. It's printing the one and then it's calling the function again with two as long as it's less than ten and then it goes what around. Exactly that one still is not right. This one is right. Because even here I'm doing recursion this one I'm not doing any recursion. But this one I'm doing recursion but I'm printing the results as I get them. Right. And that's not the way common table expressions work. They don't feed the result down to the main query until they're finished. So this one is right. I basically create a function I start it with one into an array. Then I call it again. I do it to ten and then I output the array. So this is really the right example because I am queuing up the values in memory into an array and at the end in my outer query I'm printing the array. Does everyone get that distinction? Right. I went from a simple for loop which is really not what we're doing in SQL to this. But again this is not putting them as I generate them which is also not accurate. This is the more correct example. Yes sir. Well you know I thought about making one test in here but it was kind of simpler to do it this way. So this can construct in Modern Pearl. Basically it says if I haven't passed anything in, which I haven't, use one. And I would argue that that actually is more accurate. Okay. You know we could probably spend quite a bit of time understanding which get pearl is exactly more like common table expressions but I think this one got really close. Any questions? In fact just inside somebody I was at a conference and somebody said you know he said I looked at your presentations and it says though you write them to be studied. You know he said I guess because of the red and stuff and I'll just go through it slide by slide and I'm like there's a whole story inside a single slide sometimes and I think that's an example of this. So I used to be a high school professor and did some college work so I guess I'm used to that but I also love it. And I know we have a lot of teachers here as well. Trainers. Other questions? Okay. Examples. Again I've shown you a lot of academic stuff. A lot of stuff that doesn't make a lot of sense. Now we're going to actually do some cool output stuff with this data. Okay. So factorial. I promised you every time you learn a language you got to learn how to do a factorial. And we're going to do that in SQL. I don't recommend you do it but again great example of showing how to make an imperative problem. And you force SQL to do it. Now Postgres has a factorial function. There's no reason to do this but it's very illustrative of seeing how to do something. And we see all of the common features. All of the common structures of common table expressions here. Starts with with. Which we know. Uses the word recursive. Which we know. Then has a source and then it labels the two columns. Counter and product. We start with one comma one as a prime and the first column goes up by one each time. The second column takes the counter adds one to it and multiplies it by whatever the product was. And we do it up to ten. And then at the bottom we just output what we have. And if we look at the output that's exactly what we think. There's ten factorial right there. And we also have the factorial of all the numbers up to ten. But now you may not want that. You may only want to see ten. So what I effectively did here was I added a where clause in the outer to say let's trim off all the other stuff that came from the common table expression. Just give me the ten that I want. And that's a very clean output right there. So yes you're correct. If I try to do any filtering in the CTE it's going to the effect of that filtering is to stop it from continuing. Now I did do filtering here. Now what I cannot do, and this might be where you're going. I can't have this where clause and get rid of that. Because the problem would be I would never finish the common table expression to get to that where clause. And because there's an optimization barrier I can't take the where and pop it up into the common table expression. In fact because the counter wouldn't equal ten because the star is one it would just stop. That's a perfect example. And again keep asking questions. This is a tutorial. This is what it's for. You choose not to have those breaks. Let's do the Q and A. There's a wealth of stuff here. When we do window functions it's going to be a wealth of stuff. I guarantee you like whoa what in the world was that? I'm used to that. So the question is without recursion is there value to doing CTs instead of views? The answer is there are performance values to using CTs instead of views. There actually are. The reason is because there are some cases where you as an application developer know more about your data or you have a particular query that uses a very expensive function. And you want that function to be run under your control not over the way the optimizer happens to reorder your function poll. Or certain joins you want done in a certain order and then outside you want to do something else. So there are a number of developers who will take a long query and actually break it up into parts because they want to control the order. Now we should probably add an option to turn that off but my guess is we'll probably keep the optimization better there by default particularly for that reason. Another question. Can you tune the optimizer so the CTs are not an optimization better? I think we need to add that feature it's on the to-do list but we don't have not done that. So you cannot tune the common table expressions not to act as an optimization barrier. I would like to see that happen but it has not been done yet. So effectively on this and this is actually the pearl example of what we've talked about. Again here's the priming here's the recursion right here and we're pushing the order array and then we're outputting. Now you might have seen me do a lot of numbers. You might think well what can I do with strings? The answer is you can do strings too. So here's the example. I'm priming it with an A and then I'm concatenating an A, so I primed A into here and then I say as long as the string length is less than 10 give me another A. I select the whole thing and I get all these As like that. Not really exciting but it works. So you might say well I want to have a B next to an A and then a C. Well if you kind of parse this big thing here it's effectively making the last letter and then getting the ASCII value and adding 1, 2 and then converting back to a character again and then you can and you get this. A, B, C, D, E, F, whatever. And there's some example you can imagine some use cases you might want to do this for populating a table or something like code table or something like that. That's kind of cool. This one always gets people kind of laughing. This is actually always the demo people use for common table expressions. It actually gives you an ASCII art. I start with minus 10 and I go to positive 10 and then I print a certain number of spaces and an X and a certain number of spaces and an X and I get that. So I'm drawing a big X. There you go, exactly. Big X. Here's another one. This should show you a little better what happens. We go from 10 minus 10 to 10 and we print zero spaces for 10, one space for 9, two spaces and then again this size gets smaller. And here we print I think eight spaces or nine spaces and then two letters next to each other right here at zero and then all the way down. You can do other things. I can make like a diamond by changing the way I'm computing this. So now it's a diamond. I can make it more rounded if I want. Use a little power there. I don't know what that is. It actually looks like a piece of bread. Everyone seems like Italian bread like that. Hatchaporty if anybody knows Georgian food. That's what I used in Russia. Everyone laughed at that in Russia. But there are some breads that look that shape. And then I can do kind of this. This is more like a hatchaporty to me. It's not really a wider, not really a great circle but it's the best I could do. And I've seen ones where you do Mandelbrot, Christmas trees. I mean there's all bunch of stuff you can do. I don't think it's very useful but it's very illustrative. I think of that imperative aspect. Here's another one again usually a CS101 example prime factors. So every integer has a certain number of prime factors. For example for 100 it's 2,2 and 5,5 66, 2,3 and 11. And you can actually do this prime factorization in SQL. Using case statements, as I told you about before. Here's the prime factor, the 56. And I basically run through and I go from 1, this is my counter. This is my remainder. And this is whether it's a factor or not. And again I do it until the factor is 1 and basically for 56 I go down and all of these that are true are factors. So 2,2,2 and 7 are the prime factors of 56. In fact if I actually change my output and I say only show me the rows of the prime factors this comes up a little better now. 2,2,2 and 7 all my prime factors right here. The question is does it execute this module 3 times? Indeed it does. Now here's the prime factors. I just picked the number 3,2,2,4,3,4 right? And if I run it these are the prime factors of that number. Very imperative. This is actually the prime factors of 66. And again what you're noticing here is it's trying every number. Like why is it doing 9 if it already did 3? Like kind of weird. Why is it doing 8? Why is it bothering to do numbers it already did? So we can actually optimize this. Of course make it faster. Here we only do the odd numbers after 2 and now that trims down quite a bit of the examples we have. It has an exit case there. And again if you only want to see the prime of 66 there's the answer. This is again the example in pro for that. Questions? So now we're going to get into an actual application that kind of you might do as a real person. It's always a good idea. Here basically what we have created is a part table and what we're doing is we're inserting a whole bunch of parts. Now the interesting thing about this part table is that it has parts and then it has sub parts. So think of an airplane. What does an airplane have? A fuselage, wings, wheels, tail sections. But then inside the wing there's a whole bunch of parts inside the fuselage there's a bunch of parts. The wheel has a bunch of parts. The landing gear, the tail thing. So effectively for each we have a number of parts like part one but then we have sub parts for part one and then part two and sub parts for part two. And then part 11 has some sub parts and part 13 has some sub parts. 22 has some sub parts and 23 has a sub part. So think of this as a tree. And the tree uses a sort of tree traversal in FQL. It's very common for common table expressions. Without common table expressions if you need to traverse a tree you have to effectively call you have to issue a query. Give me the top level parts. And then for each top level part you then have to send a query again. Give me all the sub parts of the fuselage. And then for all those sub parts give me all the sub sub parts of the fuselage right? And you have to keep going down down down. You're issuing queries over and over again from the client over and over again. What common table expressions do is they give you an ability to do all of that tree walking in FQL. Again that imperative aspect where you're telling it walk the tree of the parts, walk the tree of the plane and do certain behaviors. So this is how we prime it up. Here is a recursive common table expression. We're saying for part two which is this set right here. Give me all of the sub parts that are parts. So this is the fuselage. Part two is the fuselage. Twenty one and twenty two are the sub parts. And then these are sub sub parts of the fuselage. Now you might not like that but this is a very simple example. All we're doing here is we're priming it with two. We're saying give me join the table part number and give me all the sub parts and then return that two up here and then keep going. So that's how your leap is going here. Like that. Walking down. So that is a great point. This is one of the first cases where the sub query here is returning more than one row. Because up until this point we've always had it returning one row and one to ten or whatever. Now our query or this query here could return multiple rows and in fact that means that this three rows get added there and then three rows get instantiated. Now it's a three row table instead of being one row table and it's going to roll through. Now because of the way this structure is set up I don't have any leaps. So I'm never going to loop around and end up back where I started from. Because that would be bad. You can't save the wheel as part of the fuselage which is part of the wheel. Like it doesn't work that way. Nobody would set up a plane like that because it would never fly. But one interesting use case and I normally don't cover this but I've been sort of harping on the fact that we want to use union all in almost every case. Now if you're worried that your data has loops in it this is an excellent and maybe I'll add a little mention down here now that you picked that up. Nobody picked that up before. Maybe if you're worried that your thing has loops in it and you're willing to have the overhead of a duplicate removal you may want to remove the all here. And what that will do is prevent it from going into a loop just like back on slide twenty something when we did the hello and we didn't use union all it didn't loop. Do you remember that? Because it knew I already had one I'm not going to add another one. So if you're worried your data has loops you might want to remove that union all there. So all it's going to do is just remove the rows that were duplicate any other unique rows are still going to be set in. It's not going to stop on one it's going to say oh I have three rows one of them is duplicate now I'm only going to pass two rows up and loop on those two additional rows. Yes ma'am. Great. Thank you. Yes sir. So you're asking if you can add a limit one here out here? I'm sorry. Here? If you add a limit one here you're only going to get the first part of the sub parts. So if you may have an example this would be a great example. So if you added an order by here and you said I want to see the closest friend of this guy and then I want to see the closest friend of that guy, of his friend you would eliminate all of the secondary friends and only get the primary friend if you could have a ranking of friendship for the guy or the woman. I don't know. I guess there might be some cases where you would not want to fully go through the whole hierarchy of sub objects and just pick one. But you'd have to have an order by in there. And what I don't remember is the order by is executed for each iteration or only at the end. I think it's each iteration but I have to look at that. Nobody's ever asked me that question. But normally you're going to want to use probably not limit unless you're sure and you're probably always going to want to use union all again unless you have loops in there and then you want to eliminate the loops and that would be a good way of doing it. Yes sir in the back. Can you be a little more specific? You used some acronyms I didn't know. Yeah. So that's I'm going to have to go here to kind of talk about that. So if I was returning the parent of these objects then the union all would get me into an infinite loop. So one of the reasons that I'm just returning the sub part that goes with the part is if I return the parent the parent would keep repeating and I'd never get out of the loop. So I think anytime you're going to get duplicates unless you get rid of that all you're never going to get out of that query. But there might be cases where you'd want to stop at a duplicate and that's where you would use the get rid of the all basically. So if you have a child node that has no parent multiple parents. Oh oh oh oh oh oh oh oh. So if you have a child with multiple parents you're going so that means you have multiple ways of going in and in that case you only would remove the all because then you would get rid of yeah that's another example I hadn't thought about but you're right you would have your child may appear multiple times. And if you use union if you got rid of the all it would only appear once. That's right. Now there's another trick to this and the trick is that I'm only returning the part number I might want to return the parent to go with it. Now I'm going to show you some examples in a minute which really go a little bit onto the steroids kind of range because right now I'm just returning a number but wait till I show you a little bit more here. So this I'm just returning the part number and again it doesn't handle the case where a child might have the same part. You might have the same part in the fuselage and the wing. The same bolt might be used in both places. So that's a good example. So in this case I'm actually returning the level with the part. So here now I'm actually putting a level and I'm having a different number of dashes depending on what level it's at. So now I'm displaying graphically how far down the sub part is. Alright? Using this part in red here. And I can also order it so now instead of having the part and the sub parts notice actually these two parts are parent of this one and this one's fair but they're all kind of bunched together because of the way I've ordered it. So now I'm going to order it by the tree and the tree is effectively a version of all of the part numbers. So now I'm getting not just the part in gentlemen you should take a look at this. Now I'm not returning just the part, I'm returning the hierarchy of the part. And that's where I start to get rid of my, I don't have duplicates because it's the same bolt appears in two places with different parents. Different tree hierarchy. So effectively what I'm doing is I'm now giving the tree hierarchy to come down. But this is kind of weird because I'm actually doing it as text which is kind of awkward so in this case I'm actually creating a postgres array and I'm appending the parts to the array and now I'm getting a nice array that I can order by. So instead of doing the ordering by ASCII which is really terrible, don't do that. Right? Because again it's not going to work the way you want it to. 9 is going to appear after 11 because 11 is ASCII less than 9 right? So by doing it this way I get everything in an array and then I get a graphic. So in a lot of cases you wouldn't want just the part, you'd want to see the whole hierarchy like that. And now if we have a part that appears in two places it's going to appear two different times even though we use union all. And we're going to get a clear this part if you're here and this part also appears in this hierarchy. Alright? And the full output is basically this. You can see the different levels, the different part numbers and then the tree and then the array that gives me the structure of all that. So again this is kind of taking that part building more and more and more until we're starting to do some really useful stuff. Again not having to repeatedly send queries over and over to the server just sending one query. The server is doing all the work much more efficiently than we could do. And then returning the result back to us. I think very very powerful. Any questions? This has a dependency table inside built in. So I'm just going to show you an example of displaying the Postgres dependency table. This is an internal table that's only used by Postgres. You'd never use it yourself probably. But here if I create a table called Test I can actually look at the Postgres dependency table and I can actually output the dependencies that come from that new table I created. And if I add a if I don't want to see depth these are the dependencies without seeing the major one. Now if I add a primary key to the table I now have two new dependencies. Right? From my table. And if I add a serial column for new dependencies. Okay? And again you can see how internally Postgres has a dependency table. One table has a couple of dependencies. And again I can walk that hierarchy within this complicated SQL query. This is very Postgres specific. Again you would never use this probably. With the idea of creating a dependency graph in SQL obviously very possible and very easy to access using common table expressions. And effectively this is the tree right here with arrays of object ideas. You can see some of them are primary, some of them are sub-objects, some of them are sub-sub-objects. And again we can go down as far as we want. We're getting to the sort of end. Well I promise that you would be confused or at least surprised. And what we're going to do is we're going to conclude with writable common table expressions. Again this is Postgres specific as far as I know. But it does take common table expressions to a whole new level in terms of what they can do. Okay? And once we're done this section, once we've done this slide, we'll take a 20 minute break, go out, have something to drink, or just stand up and walk around. And of course when we come back we'll do some functions and hopefully your brain will have relaxed by them. Because the winter function talk is again very complicated. So you will need that break to sort of let your brain unwind a little bit. At least I always do. Any questions before we get started? Okay. So writable common table expressions. The writable common table expressions. So these two bullets are not the same thing. I will try and highlight the difference. Writtable common table expressions allow data modification commands in the with clause. So you've normally seen only selects with common table expressions. What we're going to show you now is cases where you do insert, update, delete in the with query. Select at the bottom. So the main query is a select but inside that common table expression is an insert, update, delete. With a returning clause which I'll tell you about in a minute. Okay. That's bullet one. Bullet two allow with clauses to be attached to insert, update, delete statements. In this case the query inside the common table expression is a select and the main query at the bottom is an insert, update, delete. Okay. So two examples one has the insert, update, delete in the with query the second example has the insert, update, delete in the outer query, the bottom query. Does that make sense everyone? That distinction? Okay. First example we're going to create a table called writ demo I don't know why I call it that, return demo I guess. And we're going to insert three random numbers. But we're also going to use a special postgres syntax called returning. This is not sql standard I think David can you confirm that's not a I think it's a postgres baby there Yeah. It kind of takes this stuff to the next level So for example if you're inserting three random numbers you don't know what those random numbers are. But if you do a returning clause at the end you can actually get the results back from the insert. Alright. That's kind of cool. Here's our first example Inside the common table expression I'm doing an insert of three random values and I am using the returning clause What does the returning clause behave as? It really behaves like a select. Doesn't that look like a select I'll put to you right? So it's an insert with a select kind of at the end which only returns the rows that I inserted That's kind of cool. So here I can say insert three random values return as a select and then give me the average of those three inserted values. That's kind of cool So three random values give me the average there we go Here is another example delete all the rows in the common table expression. Give me all the rows in the table but return a select of all the deleted rows and give me the maximum one. What is also very interesting about this is this happens as a single snapshot as a single transaction So this would be very hard to do in another example because you would not be sure you got all the rows. Somebody could have inserted a row between your delete and then you're like doing some weird stuff with me So the delete happens in the same snapshot I'm returning all the rows that results coming right back and I'm doing the calculation. There's no way to get in the middle. No way for somebody to get in the middle there There's literally no way to do this unless you just lock the table and then in a transaction block delete it select it delete it and then give them that kind of awkwardness where this is like BAM one query. So is this hoping atomic if one of these was to die or fail? It's completely atomic it is one block. Now if you're doing this inside a transaction block then again it's a single query and again it's controlled by it could abort the whole query you could have a save point in there whatever. This is one single query. If you're not using a transaction block then effectively it's a transaction Could you use that to move from one table to another? Absolutely you can use it to move from one table to another. In an atomic way that's right. And what you could do is you could yes you could delete it this is exactly where you're going you delete it and then insert it. I have an example right I'm getting to it. I will show you an example So this is the second bullet This is the case where we do a select inside the common table expression and then we do an insert update delete outside. So here I'm going to do another table called vet demo 2 and I'm going to insert 3 random numbers I just seem to love random numbers. And here I say with source give me the average of my rows in vet demo 2 and then delete everything less than the average. Very very hard to do in normal SQL unless you lock the table and then do your operations. This is one query So we're basically saying give me the value and then I actually now I can look and see how many rows are left in fact because I delete all the lowest values I only have one row left Yes sir. Where am I? What's the transaction? Where am I committed to the transaction? Postgres is in auto commit mode so by default every query is its own transaction that's committed when it finishes It is not like Oracle is not in auto commit mode you have to commit after every command unless you turn that off In postgres the default is to be in auto commit mode you can turn that off but by default it's in auto commit mode. So the question is insert to data manipulation language. Postgres effectively that entire query is one transaction whether you do insert or delete I want to show you an example in a minute that does an insert and then an delete and then another insert that is one query that is one transaction block if you're in auto commit mode you don't even have you can even put DDL in a transaction block in postgres whereas Oracle again we're getting on to steroids here but we've had this for 20 years we've always allowed DDL in transaction blocks currently DML multiple DML in fact DML pasted together with common table expressions which is what we're going to show you in a minute I haven't even gotten to the most complicated one yet yes sir so yes it does so this is auto committed this is auto committed this is auto committed and that is auto committed thank you thank you very good point so here's another cool one basically I'm saying take part two generate all the sub parts and then delete them all a common table expression that does a recursive select and then delete everything it returns one thing this does not do and I need to go back to this gentleman if you had a sub part that was used in two places I'm afraid it would delete it which is not good if you had a sub part that was used in two places and you didn't delete one of them the other one would delete the whole thing would go away that's a problem so maybe that's something we should work on but again this is a tutorial you started talking about having a part in two places and then I started to think that would be a problem great example walk a tree delete a tree return a tree now a walk a tree delete a tree really really very convenient here's where we can start to go so here I have a red demo 3 again as I told you I had the SQL if you want to run it you can run the whole SQL command fly it off your screen red demo 3 put three random numbers in okay give me the average of the three numbers in the table second common table expression delete all the rows that are less than the average and then return both ways to return what's left if I wanted to but I didn't just want to delete it so we're saying give me the average delete everything less than the average show me what I just deleted one slide so here we start to go to the next table frankly here we have an orders table which has an items table you have orders and each order has a bunch of items in it so classic case now I can say insert into the orders table but you might notice that this is a serial column so therefore it's auto numbered and I'm using it as a fall and then give me the default order number assign the default order number during this insert normally you then have to find the order number with next val or left val per val and then use it that way but what you can actually do is you can say return the serial column I just assigned and then insert my order using the same order I just used so what I've done is I've inserted into the orders table and the items table in one statement instead of having to do the query and then say give me curve val and insert curve val I'm doing it together okay here's another one delete from the orders table return the order ID and then delete all the items for those orders so again just like that delete the orders and then as the same thing delete the items for the orders so you start to combine stuff together becomes very very powerful here with another one delete from the orders table then delete from the items table and then take the order number and insert it into another table like as a backup so I got an old like I have an old orders table and I want to just say this is what I believe and probably I'd probably add like a curve val here like current time stamp so I know when I deleted it but again this is just an example right but that's how you would defect if we do it if you wanted to move data which the gentleman was asking about right you would delete and then you would insert into the other table and in one snapshot in one visibility right you'd move the rows from the one you'd be deleting them they'd be returned and then you'd be inserting them somewhere else and then commit it together you can't lose any data there's no visibility you don't even have to lock the thing that's the beauty of it normally to do this you have to lock the table select it delete it insert it and delete it and while you're doing that the stuff's locked here you're locking each row as you're deleting it but you're not deleting you're not you're allowing other inserts to happen to the table at the same time right so it's kind of like a moving stuff around but at the same time people can be adding new rows that aren't part of my snapshot I don't see them I don't delete them nothing happens right tremendously tremendously powerful this is not only making your life simpler in terms of having one query that does a bunch of things but it's allowing you to do something concurrently that's very hard to do any other way and while you've commentatable expressions almost done allows a third of processing of sql merges mobiles to your queries and they're connected application logic together unifies them improves performance by issuing fewer queries less transmission overhead less overhead for optimizing for example uses the same snapshot for the entire query very hard to do this outside of commentatable expressions have multiple queries use the same snapshot you can use repeatable read mode if that kind of gets you around some of that but then you still have to lock the table which you probably don't want to do but it does add an optimization barrier and that is all I wanted to cover today for commentatable expressions so what we're going to do is it is exactly three o'clock is that right yes and what we will do is we will break until three twenty and we will come back and we will talk about window functions so twenty minutes time just to remind we will start promptly at three twenty so we'll get started in a minute we'll get started in a minute okay thank you welcome back everyone I won't need to go through introductions again because we did that same story presentations are on my website Resmomgen work enterprise DB one thing I will tell you about this talk is that we implemented winter function six or seven years ago and at that time I thought hey this would be a great talk let me write a talk about window functions and I started writing a talk and I got stuck and I kept the sql around again as I said before I write the sql and then I write this talk and this summer I got a chance to go back to this topic and for some reason it gelled in my head and I was able to write the talk so this is some parts of this is six years old and the reason I got stuck is because there are some aspects of winter functions it is really hard to get your head around so the first part of this talk will be getting your head around some concepts and some terminology so as we get into some actual use cases it will start to make sense so again this should be about an hour ten take us to four thirty finish up maybe around the course to answer any questions or help anyone after that I might go to four thirty five might be next to five minutes we shall see any questions what are we going to talk about first we are going to introduce window functions what they are conceptually then we will talk about the syntax of window functions very similar layout that we had with common table expressions then we will look at the window syntax with generic aggregates that may not make sense to you but these are things like sum and count average then we will talk about window specific functions things like lag lead rank and then in section five it really gets interesting where we take a real world example and do analytics on a particular table with a certain number of employees and finally we will finish with discussing considerations of window functions okay so introduction to window functions what are window functions window functions are part of a larger set of features that Postgres supplies for data analytics things like data partitioning materialize views common table expressions right there that's something you can use for data analytics bring index grouping set roll up optimize aggregates table spaces all things you would use for data analytics window functions is merely an additional feature that people use for data analytics and again follows the same pattern we did before the concept that using window functions makes your life easier as application programmers having window functions and not having to reinvent the wheel and giving the database server the problem and letting it solve it is often the best and most efficient approach when you need something done and that's again the policy we come through this entire thing so one of our window functions is a section from the Postgres documentation slightly modified by me and committed to probably six or nine months ago so this is what the Postgres 10 box looks like effectively just to highlight some words here normally an aggregate will reduce or aggregate a number of rows into a single row however window functions do not work well as rows to become grouped into a single output row like non-window aggregate functions would instead the rows retain separate identities behind the scenes window functions can access more than one row so when you think of running an aggregate you think of taking a whole bunch of rows and getting a number of rows into a single output in window functions the rows retain their distinct characteristics you do not eliminate rows when you're using window functions and that is perhaps the most fundamental difference between normal aggregate and a window function aggregate I know it doesn't make a lot of sense right now there's a whole bunch of stuff I'm going to talk about in the next 10 minutes or 15 minutes because I'll be talking about concepts that you think would have nothing to do with window functions but in fact are fundamental to understanding them as I said earlier a lot of people have seen this talk have used window functions before say oh I thought I understood them now I really do understand them in addition there's a lot of people who when they need to do a window function just do a web search find a window function that kind of does what they need to do down and they hope it works that works some of the times there are some pitfalls to that so I believe if you're going to use window functions sort of put it on autopilot you should understand what you're doing with these words you're typing and that's really the goal of this talk any questions again keep your eye on the red because we're going to use the same pattern of looking for red text here is a query I'm going to use repeatedly in the first part of the section the function called generate series which effectively just degenerates the series the numbers from 1 to 10 in addition just like the common table expression talk we have a window.sql file which is this entire talk as an sql file that you can run in your browser or in psql okay yes this here so when you call a function inside of a from clause in postgres you need to label the name of the virtual table and the name of the column in that table you could kind of guess that the function would be called generate the table would be called generate series but it's not really clear what to call the column so we kind of require you and the sql standard kind of requires you to actually label what that function's name is for the virtual table and the name of the columns you could have more than one column. Other questions here's our first window function wow we got the window function on slide 8 right why do I know it's a window function because of the word over anytime you use the word over you're now taking out and making it act as a window function just what it is that's what over does there also is an optional frame clause and other clauses you can put in parentheses in this case we don't have any so we're just going to use it as a fault and when you actually look at the result it's really contounding like this is the best we could do and indeed but that's for the fault behavior the fault behavior for a sum with no print nothing in the parentheses is basically to sum the entire set and place on every single row one of the aspects that you need to realize here is we have not reduced the number of rows generate series generated 10 rows we have 10 rows what effectively has happened and again it was in that description before from each row we have reached up and out to all of the other rows in some cases so for each individual row and you can't see it here you'll see it later for each individual row we're reaching around to the other rows and putting a value on that row that is representative of values not only on that row but potentially on any other row in the set I'll say that again the number for a window function represents not just that row but potentially can be affected by other rows in the set so does that add them up only once or does it reference practically I think it's doing it every time I don't know if we optimize this case or not I've never looked at the code we could have in this case but in other cases we couldn't have so whether we do that or not I don't remember and in fact I don't even think explain would tell you you have to look at the C code to see whether when we realize that the frame clause is going to affect all the rows the same it would be pretty easy if you look at the code because you see a parameter that said all the same or something like that but I've never looked at the code second example we're doing two window functions we're doing a count and we're doing a sum and each row can see 10 rows and each row sees a total of 55 and I can do this I can actually add a window clause here I'm adding a window clause called W which is the default and now I'm referencing W in both of the window clauses this is the shorthand if you're using the same window specification multiple times you can give it a name and then use that name reference the name any place you want in the query I mentioned there's stuff in parentheses literally this is the default meaning when you don't put anything in the parentheses the default meaning is range between unbounded proceeding in current row your first tar pit is the current row does not mean the current row literally this is probably what got me off the wagon when I started okay so I'm going to talk about that in a minute but just red flag that current row is not always the current row any questions? so let's look at the window syntax so here we have the full syntax function you have an optional partition by clause you have an optional order by clause then inside you have what's called the frame clause which can say the word range or rows and then you have a frame start or a between frame start and frame end and the frame start and frame end can be any of these things again here's the syntax if you want to look it up in Postgres alright quite convoluted I'll be talking about all of these things in the next between unbounded proceeding current row this is the default you get the default has no partition by has no order by and has the word range right here and again this is just something I'm going to talk about in a minute but range means current row default to representing all rows so current row when you see current row either as a default or you're literally typing current row in your query it can mean three different things it can mean what you and I would think it would normally mean which is the current row okay it can mean the first or last row with the same order by value technically called peers okay or it can mean the first or last row with a partition or the set and depending on what other words you use in the window clause current row can mean any of these three I don't think this is a great design from the SRL standard committee but this is the hand we've been dealt with and this is the hand that we have to play current row current row will mean literal current row only in something called range mode alright if you're in rows mode you're literally only current row if you're in range mode with an order by current row means the first or last row with the same order by value I'll be showing examples of this Dave looks like he's getting into gestion I don't blame you I remember going to my wife and saying I think I finally got you know like Einstein or Alexander Gambell I think I finally got it because I just kept digging and digging and digging to find out all this behavior which is documented in Postgres and documented as to a standard but because the names don't mean what they think you mean it's a very hard process to digest it's like if I say that this digit means six and if I type you know this and a two well it's actually sixty two not like the number it's like a number that doesn't look like a number anymore it's a different number it's kind of the same the current rows isn't really current row it could mean something else depending on what day it is right I mean almost that's where you start to get confusing and again without if you're in range mode and you don't have an order by current row means the first or last row of the partition in digestion so let's graphically look at this this is the literal current row right here okay I'm just doing some example with some duplicates so you can kind of see what's going on this is the literal current row literal current row in rows is the current row in range mode with an order by x it's two values that have the same order by value yes man so the question is I have a 30 second explanation for what a partition is I have a slide I think back in 50 something that actually shows you partitions up until we get to that slide just assume the partition is the whole result set okay so by default if you don't use the word partition partition is the whole result set so when I say here first and last row of the partition it's really the first row of the result set that's why we got 55 actually and I'm going to show you exactly why I'm going to walk through some examples okay great question other questions the window frame in rows mode between unbounded proceeding and the current row this is unbounded proceeding to the current row in range mode with me alright I added an extra one here because I went from range rows mode to range mode right and this is unbounded the entire partition or the entire set exactly what you're saying here right citation in this there are there's some cases in the docs where we talk about partitions but it's the whole set because we haven't defined one but then there's other cases where when you define a partition then it blocks off the set into separate partitions and I'll show you some examples probably in the 50 60s here's a better example this is actually the sql you would use for each case okay so just right now rows mode current row to current row bingo makes sense right start at the current row end at the current row I'm in rows mode order by x range current row to current row the current row from beginning end okay this is rows unbounded proceeding to current row in rows mode unbounded proceeding is this and this is rows between unbounded proceeding and unbounded following any questions no one's ill good now we're going to start to get through the weeds a little bit this is not a pleasant part of the talk it's like I don't know how to explain it it's sort of like kind of you get in the ocean there's all this like sand and like jellyfisher around get out to an area where it's nice and clear and we're kind of getting out to that clear area but there's a lot of jellyfish and like shells on the beach and they're going to hit you in the foot and just like yucky stuff you don't want to know what that is on the sand like I don't know what is that from dead animal whatever that's kind of where we are right so we're going to show you specific examples now walking through what I just talked about and I think this is going to be very clear keep your eye on the red text because that's where the differences are in the queries so this is the query we had before exact same query and remember I said that the fall is range between unbounded proceeding current row the reason this looks this way is because we're in range mode we have no order by so the current row is the last row in the set or the last row in the partition I'm going to back up again a little bit I talked about this current row can mean first or last row in the partition range mode without order by right yes sir if you say rows between current row and current row you get just your current row that's right if you say range between current row and current row then you get all of the rows that have the same order by value if you have an order by and if you don't you get the entire set right I think I have an example here you're definitely going to be able with that one but I do have an example okay so this is range between unbounded proceeding current row and again unbounded proceeding that would make sense to us the current row because we have no order by and because we're in range mode we are getting all the way to the end if I change range mode to rows mode this is the first time if we've actually seen specific changes on individual rows instead of a fifth the same number repeating over and over again yes sir and I challenge you to say it three times faster but anyway that is exactly it and you had to select every word as you said that to be accurate yes this is why I gave up six years ago yes and why I finally cracked it now this one is interesting because effectively what it's doing is it's going from the first row up until the literal current row so one is one three is two plus one six is three plus two plus one ten is four plus six we're all the way down and the fifty five is here because we know ten is one is fifty five right nine is forty five nine to one is forty five right so it's actually behaving in a predictable way here yeah that would be bad yeah when I get to the section four there's a whole bunch of manipulation you can do where you can reach back two rows or forward two rows or really cool stuff yeah I'm just starting like you know let's get to the water first right but we're going to get to a point where we can do all sorts of cool manipulation right now again just like common table expressions probably no one's ever going to do this actual query right there's no reason to but illustrative wise I think it works really well okay in fact if you look at the count it's literally telling you not the number in the first column it's telling you how many rows it sees that's what we're asking for the second row is count so how many rows I see two rows here we see three rows all the way down it's not copying this row it just so happens to be the same number that's it so in the example could we call current row the current frame no no because a frame is effectively a beginning and end of a set so it's really not the current frame it's a row the frame is literally has to have a start and end point so you can't say the current row is a frame because it doesn't make any you yes you can say yeah you say current row in that case is the frame end because if you look at the syntax that's exactly what the label says yeah but I'm not sure it gets you anything somebody had a question here yes sir so the question is is it actually going through and actually calculating this for each one the answer is yes the good news is that nobody ever really complains about performance for window frames because effectively what you do is you run the query and then you run this frame macro over it and effectively all the data is in memory it's not that big a deal I've never heard anyone complain about the performance of this and it's certainly always going to be better than trying to do it in the client application so I really would not be concerned about that whether we optimize certain cases I can't remember whether we do it yes please yeah it's pretty minor yeah I understand yeah and then you're doing the question is if you do something iterative and it's long strings and stuff like that yeah I don't know I've never seen people do that kind of thing sounds interesting yes sir okay so if you don't say between then the frame end defaults to current row except we don't know current row means it defaults to the text current row whatever that means yes and I'll show you some examples in future slides which actually show me using the default so I haven't gotten there yet but I have only a bunch of slides that show you one query with the defaults and one with nothing there and it'll default to current row always right because that again you can specify the beginning and then the end is just the default right other questions so yeah actually this is an example here's an example right here default end frame you must be looking at the slides right there we are right so effectively if you don't I didn't even remember that was the next slide so if you don't specify an end frame it defaults to current row and effectively you're getting the same result here and I'm saying I'm bound to proceeding and what is not specified but the fault is and current row between and current row okay so that's you use the word between and then current row there but again the same result what if I say this rows when rows mode because that means current row literally means current row okay so rows between current row and current row and what do you know every row sees one row and the row is actually itself because it's summing itself one time right actually makes sense and I can do this if I say rows current row what's the fault for this end current row right so it's current row to current row it's the same result current row to current row same as point win one for the team the whole thing is flipped over now I'm in rows mode which means current row literally means current row and instead of saying unbounded proceeding to current row I'm going to say current row to unbounded following and I just flipped the whole thing over I start with 55 which is one down to 10 and then two is 54 down and then the last row is just itself if you did rows between current row and current row would you effectively nullify the purpose of doing a window well the problem is I'm only showing you like numeric aggregates but there's a whole bunch of custom aggregates and I think you probably would you probably would never want to do that because the whole value of window functions is to be able to reach around outside your current row so if you did current row to current row and you're in rows mode there's probably no reason to do that I can see a reason for doing current row to current row in range mode or if you had a partition yeah that would be another example if you had a partition or not I don't think here's our first interesting one where we start to do something different I'm now saying rows mode between one proceeding and current row so I'm saying I'm in rows mode so little current row literally means current row and I'm doing one proceeding so now I find the count so it has no proceeding row so it only sees one all the other rows see two right if I do the count with an X instead of a star you'll notice these are the same the difference between count star and count X this one will count null this one will not the point is that these are not these missing rows are not null they're actually just not there and they're not counted what do you star the sum is the row plus the row before it if there is one so 3, 2 plus 1, 5, 3 plus 2, 7 4 plus 3, 9, 5 plus 4 all the way down 19, 10 plus 9 okay, kind of cool one following same trick other way all rows see two except for the last one okay 3 is 1 plus 2 5, 2 plus 3 all the way down 10 is 10 because there's no way to answer it 3 proceeding now I'm going 3 back so you'll notice that up until row 4 you don't see 4 rows because there isn't 4 rows 3 rows before you do these rows so the first row only sees itself second row sees 2 before third row sees 3 before fourth row finally can see 3 before and all the way down and you'll notice the numbers 1, 2, 3 they add up 10 is 4, 3, 2, 1, 6, 3, 2, 1 14, 5, 4, 3 and 2 and so forth all the way down any questions about that it's pretty straightforward I think and again these are we have a whole section which actually walks through these examples and does an analysis of employees so just hold on for that point we have to finish this section and then we've got another section to go before we can actually get to the examples when we get to the examples all of this kind of comes together some actual analytics on some actual data which returns some real results but we aren't there yet I have to get us there so let's look at order by, remember what I said when an order by happens all of a sudden current row in range mode means all of the values have the same order by value and I'll show you some examples of that so order by x I've left everything default and now you will see that the current range between current row actually this current row is literally almost the current row here current row peers are rows that equal value for the order by columns or all partitions if the order by is not specified here we do have it and we're getting a different result because we've added the order by, yeah what's this right here exactly the same, right so when I don't specify anything here I'm really getting this and when we now see the text it makes sense because we're range between unbounded proceeding current row I have an order by so the current row means all of the values that have the same order by value, we have unique values here so this order by for this data order by ends up being just the current row but just for this data we have duplicates and then it's going to start to make sense so practically for this data it is but I would almost tell you never to use the default just spell it out because it feels like the defaults are so error prone that it's really just not worth trying to use them I realize it's a lot of text you can use a window clause but there's just really no value of not using these defaults because the defaults will trip you up and I'll show you some examples in our real world data that will trip you up so here's another one this is order by x range, current row to current row and because they're all unique you basically get this output and each row only sees itself because there are no peers and each row has no peers you read the Postgres docs or really any docs on this you're going to see terms like peers and partitions and frames and a lot of times when I see here the word partition or frame or set I think they're the same thing no, in Windows the peers they're like completely different, they have very technical meanings one of the problems, I have improved the Postgres docs a little bit to Postgres 10 but even then when you start to look at the actual text they use you can't gloss over a term and say oh that's the same as some other term, it's not the same because they have specific meanings and that's why I'm trying to bring that out here so let's go to a better example that's going to illustrate the case for peers and instead of going 1 to 10 I'm going to just go to 1, 1, 2, 2, 3, 4, 4, 5, 5 so I'm going to interject duplicates here so if I now do my run just like before with the duplicate clause, the parentheses I get each row seeing all 10 rows and instead of totaling to 55 the first this set totals to 30 these are 55 now, all the numbers 1, 1, 2, 2, 3, you're all the way to 5, that totals to 30 so this is a query order by x now I haven't remembered the defaults so this is really this order by x range between about a same current row but current row because we're in range mode and order by means the first or last peer value so I'm going to back up why do I get these numbers why do these jump by 2 like that that's because each row can see all of the rows where it is peers of so the first row can see the second row the third row can see the fourth row the ninth row can see the tenth row and when you look at the totals 1 plus 1, 2 you get 2 for both of 6, 6, 2 plus 2 plus 1 plus 1 for both of them okay, again 12, 20, and here's our 30 right here, but for both rows because unbounded proceeding to current row, current row is the last peer value it's not my current row, it's the last row which has the same order by value as I do so it goes all the way for this row it sees this as the end of the current row whoa, what happened? this was range mode same query, this is rows mode what's happened? all of a sudden I don't see the jumping by 2 anymore I see literally once you do rows it says on 10 and my numbers are different because in rows mode current row means the literal current row, yes sir so the question is can you say order by x rows and not put anything to the start and I'm going to have to go back to the syntax I have to do, it looks like you cannot do that it looks like you have to say rows unbounded proceeding or un-rowed current row or something, you can't just say rows and stop I don't know why, but again that looks like what the SKL standard I grabbed this right from the Postgres doc so I would be surprised if this was optional we'd have a bracket here it's got to be one or the other I don't know why it was that way so you can kind of see in rows mode how that's working as soon as we switch into rows mode current row was not my peers anymore and I start to stop my frame at the current row I can do this, order by x range current row now this is interesting, I'm in range mode here here's the current row between current row and current row and each row sees two rows that sees two, one plus one, this sees four, two plus two, six, three plus three four plus four, five plus three it actually has some logic here if I switch into rows mode, here's range mode each row just sees itself and you're just seeing itself but when I was in range mode I saw only my peers but as soon as I go into rows mode there's nothing there any questions? partitions, we talked about partitions a little bit, again by default the partition is its own, it's a whole set but you can partition your data in this case I'm partitioning by x and what effectively I'm doing is I'm making each value its own partition so this is one partition and this is three partitions and you can see each row can only see its own peers because I've created a partition, I've got five partitions here and it basically ranges, it's basically this same specification so what I'm going to do is I'm going to break it up, I'm going to say one partition is everything less than three and the other partition is everything greater than or equal to three so here's everything less than three, here's everything greater than or equal to three okay? so less than three is red, greater than or equal to three now remember I haven't specified anything after this, so what is my default range between unbacked proceeding and current row in range mode there's no order by, in range mode current row is the end of a partition so now to answer the ladies question effectively when I do a sum here I am summing all of the values in the same marked partition four rows and I've got four values here, I've got six rows and I can see all six of them and in fact the total of these is 24, right here the total of these is 6 if I add an order by, well as soon as I add an order by again my default frame is range between unbacked proceeding and current row I have an order by therefore the current row the value of my peer group so effectively the first two rows can see, the first row can see two rows, the second row can see two rows, the third row can see four rows, the fourth row can see four rows and the same thing down here, starts out from here okay, yes so the question is in one case here I'm passing a value, I'm just ordering by a column value and the other case I'm creating a Boolean, effectively what I'm really ordering by is 0 and 1 again this is postgres, we have a Boolean type so effectively when I order by an expression, I'm really ordering I'm sorry when I'm partitioning by an expression I'm effectively just partitioning by 0 or 1, that's correct that would be just too much, now I could do modulus 10 and then I got like 10, 9 partition, 10 partition, right I would be doing a modulus so it would be, modulus 10 would be so I could put an expression function colon there, I could do some kind of arithmetic in this case I just used a column and a Boolean, but I couldn't put anything in there I could put a factorial, I don't know, I mean whatever I want to put in there, I could do alphabetic I don't even have to use numbers, I could put like letters I think, I'm sorry, so can you do it inside of a range clause, no it has to be, you can do it in a partition and you can also do it in an order by, you can order by an expression if you wanted to, so partitioned by x in, yes you can use the in clause, I'm sorry you can use anything you can use in F2L effectively, a function call, yeah it could be any expression, so it depends, oh that's right because in terms of two and four, so in that case you would only get two partitions yeah, thank you, yes sir, right so the question is if you had a mismatch on the order by how is that going to behave, I believe the order by applies within the partition itself, and it does not cross partitions, peers have to be in the same partition, in fact by definition, I guess I didn't make this clear, you can never cross a partition with a window function, they're like just like optimization sense for common table expressions, they are boundaries which cannot be crossed, and I'll show a bunch of examples later, so effectively every time you cross a partition boundary, you are starting with a brand new set, and again this expression is this and now it makes a little more sense, current row, order by, it appears and then you can kind of see how it works, now if I go from range mode to rows mode, right, see that, rows mode well it's one, two, three, four, one, two, three, four, five, six, nothing, so you hear it all because the order by doesn't have an effect on current row because it's just the current row, right, so the order by is kind of in material here, because I haven't I'm not really using any peer groups, there are some other reasons to use it for example, for some of the window functions I'll show you later, but in this case there's no no value, this particular equation, questions, yeah so the question is if you, if you, if the rows weren't in the order and you didn't have the order by clause would you still get the output in the same result, the answer is no one of the things that I'm not doing here, which I technically should be doing is to add an order by outside of the window function okay, the reason I'm not doing that is for clarity because I'm only using one order by the output by definition comes out in the order that I specified it doesn't, the specification, the sql standard doesn't really require it, but that's why Postgres always operates, so effectively when I only have one order by, I always get the results out, what you can do is you can have multiple window functions, always different order by's, and you have the same query using multiple order, you don't know what result, out order you're getting the result back in and almost every, all of those cases you're always going to have an outside order by, and I would always recommend that you always have an outside order by to specify what you want to come out as. No, because the order, I say that, okay, so the question is if you didn't have the order by with the sums appear the same, and you're kind of right if the reason you're kind of right, the reason that current row means unbounded following by the fault without an order by, in range mode, which is the fault is because a set that doesn't have an order by effectively has no definition of where the current row is. So how would you really know where it's supposed to be? So they kind of gave you the rows mode to kind of override that and say, I don't care, I'm specifying rows mode, whatever order the rows come in, that's the way I order I want to process them. Then current row can mean current row, but that's why I've had some discussions and I've delivered this nine or eleven times since I wrote it, and one or two times people would say the reason those faults are there is because the sets are not ordered unless you give an order by, and therefore the meaning of current row wouldn't make any sense of specifying the frame. And that's why you have to say rows mode when you really want to override this non-ordered value of the set, or add an order by in there, and then the rows mode, then there is the meaning for current row because you have ordered it. So I'm kind of backing into the answer but effectively that's the reason the faults are the way they are. Even though they're practically inconvenient, that's the reason that almost every case you're going to want to specify an order by, even if you're using rows mode because you want to make sure you're getting the data in the order you need to process it. And when you start to mix different, you get multiple window frames in the same query and then everything goes out the window because you don't even know what order you're getting the results back in. So that's a lot more detail than I normally get but you're absolutely correct. There's a reason that the fault is there is because the ordering is indeterminate without an order by specific functions. So up until this point I've shown you only things like count and sum and average, things you've seen already. But there are a bunch of window functions that are designed just to be used with a bunch of aggregates that are designed just to be used with window functions. Very, very powerful. So first one, row number. All it does is to number the results. So here's my table and here's the row number for each one. Now one aspect you should notice here, row number takes no arguments and operates on partitions not window frames. I have a slide later that summarizes this but every time you look up a window function in the manual it will say operates on partitions or operates on window frames. There's a whole bunch of functions that don't care what window frame you define. They operate on partitions. So you can go range mode, rows mode, unbender proceeding until the cows come home and you're going to get the same results from that function. And that's what I was saying. When you're reading the manual it will be very clear what you're looking at. How that particular thing operates because they're going to be different. Another one, lag. Lag is a row behind a row before my current row. So in this case one there is no row before it so it returns no. All the other rows just shift them down by one. So this is again a window-specific aggregate that can only be used with an over function over with a window function. And it's always a row before. And you can do lag too. Give me two rows before and again returns null for the rows that don't exist. I can do lag and lead. So here's the rows two rows before here's the ones two rows after. And the same qualification I gave you earlier these operate on partitions not window frames. It doesn't care about rows mode. It cares about order by. It doesn't care about between unbounded or any of that stuff. It just doesn't care. And you can define it to the false for non-sister rows. First value, last value. Give me the first value within the... help me here. What? The frame. Right, right here. This is our first window-specific function that deals with frames not with partitions. All these other ones operate on partitions. This one operates in frames. I have a slide at the end. Don't remember. It's a big slide of the old stuff in one slide. I know people look just... they pull up my presentation just to see that one slide. Because you forget. But this one operates a window frame so in fact the order by... the frame I specify actually makes a difference. Okay? Now why is that important? Because look, first value, they seem to work just fine. Like first value, it knows. The first value is one. Last value, not so much. Why is it doing this? Somebody would explain this. There's the default. The default end frame. And that's why. Exactly. The embedded proceeding is the beginning default. But the last row is the end. The end of the frame default is current row and current row in range mode with an order by or my peers. So literally this number is really that number. With this number. Right? Because it's seeing the end of it. And that's why it's kind of hopping along. As the window frame gets bigger it moves down. My last row, I can see more. My window frame is getting bigger. Yes sir. If there's no order by, the whole thing gets... If there's no order by, then this is all going to be fives. Or if I go into rows mode. Right? And I do embedded proceeding and I'm about to follow when I get to fives. Okay? So again, if you're using last value, keep in mind it's using frames. You may want to specify this kind of thing. Because you probably don't want this for last value. You might but just be aware of what you're asking for. This is why I'm saying don't use the default. And write what you want. Because the default can trip you up so much. Here's another window function. Ends value. It gives you the end value in the set. So these operate on frames. Right? So the first two rows don't see a third row in their frame. Only at the third row does it see a third row in the frame. Same thing over here. It doesn't see a seventh row until row seven. Because I'm using the default. If I do that, range between embedded proceeding and current row, I get this. This is my default. I get the same thing. Right? But now if I go rows between embedded proceeding and embedded following, bam! I see the third row right away. I see the seventh row right away. Be aware of what you're asking for. Every time you use a function. Is it using frames? Is it not using frames? Is it using frames? What's my window frame definition? Rank and dense rank. These are cool. Not so much in this slide. What it's trying to do is rank my values. But you know, without an order by, there is no rank. Even though it's operating on a partition, it kind of doesn't know what the order is, so it's kind of useless. And even if I specify the frame, it's operating on partition so it doesn't matter. It doesn't make any difference. If I go to rows mode, it still doesn't make any difference. But when I add an order by, now this starts to make sense. What it's trying to do is to take my values and rank them. Now, there's two types of rank. If two people cross the finish line first, at the same time, they're both first. What's the next person's rank if it goes across the finish line? Or two. Right. Both answers are right. They're three if you consider that the two people who went first really take the first and second spot. And the guy's third. He literally is third, right, because I had two before him. Or I have or I might want to not have a gap and like I want to have that person second. So rank will basically leave a gap. These two people finish first because they got the first thing. Then third and fifth and sixth. This one, I got two first place people. I have two second place people. I have two third place people. That's dense rank. That's rank and that's rank. Percent rank and cumulative distribution, these are trying to give me a sense of what percentage of people are less than me or what percentage of people are equal or less than me. The gory details are down here. Effectively, one of the odd things is that with percent rank, it excludes yourself. And there's some reasons for that, particularly because if you have one person highest, there's really no one less than him. So he's really greater than 100% of everybody, but if you count him or her, that's not really 100%. So there's all the goofy things going on. Cumulative distribution is a little easier because you include yourself. So therefore 100% of the people are equal or less than me. 80% of the people are equal or less than me all the way down. I put a blog entry about this. In fact, I don't really like these. I'd love to have a function that they considered everyone less than me and half of the people equal to me. If I get an SAT score, I want to know how many are less than me and then let's assume half of them are less than me. That would be a better number, but nobody's ever...there's no name for that and we don't support it. I guess somebody could write a version of that, but that's just, again, I have a blog entry about it. Finally, for entile, I'm trying to break up into three parts. So here's my first part, here's my second part, and here's my third part. I'm trying to chunk my groups together. Again, based on the order by. If I add a partition, again, same partition we used before, what I get is here these people are first, these people are third within that partition, then the rank starts over again. It's a great illustration of the way partitions work. In the new partition, these people are first, these people are third or second. Partitions, you never cross partitions with these window frames. Here's another example with percentile rank, human distribution, and entile, again, using partitions, again, starts off in zero for the second partition. Okay, any questions? Alright, so now we're getting to the meat. We've gone through all that sand and jellyfish and everything, and now we're going to actually look at an actual example, and we're going to apply window functions to this example. Okay, I'm going to show you analytics, and I'll show you why you do it this way. And again, the idea is you're not going to be able, you're not going to be copying a window function of someone else. You actually, by the end of this, are going to be able to use and write your own window functions knowing what all these things mean, knowing what all the faults mean, knowing whether your functions operate on frames or partitions and what order by does and all that other stuff, so you're not going to get surprised. Okay, so we have an employee table, and we have one, two, three, four, seven employees. Okay, again, they have an apartment, and they have a salary. It's a very traditional example here for employees. And we do a select on the employee table, and we effectively have seven employees in different departments. Yeah, yeah, yeah, yeah. And we do an aggregate. Now, this is not a window function, we're just going to do a traditional aggregate. So, count how many employees do we have, seven, total salary, this, average salary, no window functions here. This is just straight aggregate. I can do a group buy, I can get the number of people in each department, and I can get the sum in average for each department. Wonderful. But again, not a window function. I've lost the distinctiveness of my employees, they're now grouped together into a slush of one line per department. And we talked about the idea of window functions giving us the ability to maintain distinctiveness for these rows. That's what we're going to show you today. Here's another one, roll up. This is still not a window function, but it does give me the option of getting totals by department, and then a final line which has the roll up of all of the rows together. Still not a window function. So, here's our employees again, ordered by salary. And here's our first window function. We have the names, we have the salaries, and then we have the sum. The sum with the notorious open close default number there with the sum. And remember we used to have 55 all the way over here, now we've got 40,200 all over there. Right? No, it doesn't look very useful does it? Right? We maintain 7 distinct employees and now we've got the same salary on all of them. What good is that? We can actually make some use of that. Here's an example. We can take that window function and we can divide the person's salary by the window function and we can multiply it by 100 and round it to do decimal points. And now I can see the percentage of all salary for each employee. Isn't that cool? Right? Instead of showing 55, yeah. Okay? But instead of showing 55, now we can actually use that number and get a percentage. Here's an example where we actually take, we actually do cumulative totals. Now, it's not very useful in this case, but what we're doing is we're using, ordered by, unbounded proceeding to current role, and because we're in rose mode, we're effectively moving down and it's getting higher each time. Now, this is probably useless, but if you did have like, cumulative totals for a time series data, like how much rain this month and each day you want the total for the month, that would be a good example. Okay? Keep in mind I am using rose mode here. When I originally wrote this query for this talk, I used it to false. Why is that wrong? Bingo. Andy and Sandy have the same salary. Therefore, they would have the same sum. That is probably not what I wanted. Somebody pointed that out during a talk. Yes. Yes. I had to apologize and I fixed it and I said it was up in the next hour and I got bitten. Because I was so focused on showing you examples that I didn't do what I told you about showing the default. Very interesting. Learn some. Here's another one. This is a window function showing the average of each employee. Useful, but again it's the average on the same thing. What good is that? Well, as you might imagine, I can actually subtract. The person's salary from the average and everyone in red is above average in salary and everyone in blue, unfortunately, is below average in salary. It's easier to do this than trying to do it in my application. What about this? Instead of comparing the salary to the average, I want to compare it to the next salary. The next salary forward. Because I'm more earned by salary descending. How much more is Mike making in the James? How much more is James the Betty? How much more Betty than Sandy? How much more Sandy than Andy? How much more Andy than Tracy? How much more Tracy than Kelly? You get the idea. Instead of comparing it to the average, I'm comparing it to the next person who has the next high of salary. Shouldn't Sandy and Andy show the same thing? You don't get it. These operate on partitions, so they have no peers. It doesn't matter. They would always have different. With lag and lead, you're literally always accessing the next value. Not the next peer group now. You could do that maybe with partition by or something? I'm not sure. Maybe that would be a way going at it. I don't know what these numbers are. I don't know if they're monthly or yearly. I just picked something that rounded well. They come to nice numbers. Now this is interesting. This is kind of going a little meta analysis. We're comparing the person's salary. How much more that person is making the least salary employee. We're making it how a percentage they are more than the least salary employee. You sometimes see this type of analysis for corporate environments. Here, this person is making 25 more than the last person, 54% more than whatever. You get the idea. I have used a window clause because I needed last value which is a frame clause to see this bottom value. I can do rank and dense rank. Here's the rank of the employees. You'll notice that Andy and Sandy make the same amount of money. Tracy is either 6th or 5th within that set. Ready to do rank and dense rank. That makes sense. I can do departmental average. I can compute the person, their salary, the average of the department and how much more or less that person is making than the departmental average. I'm doing that by creating a partition by department here and then I'm computing the average within that department but I'm computing their salary minus the average. I don't know what that question is. I can't understand what you're asking. You could do it in the where clause here. These were clauses aren't really affected by the where clause. They're sort of independent inside of there. The question is could you exclude somebody and still have them in these calculus? Here instead of saying partition by department to partition by department I can create a window called D and then specify D in these cases. I know we're going to need those extra five minutes so just hold on here. We are very close to being done. Instead of comparing the person to the department average we can compare the person to the next salary within that department. Not compared to the average of the department but the next salary. Mike is 800 more than Betty, Sandy 800 more than Carol but again James 1200 more than Andy, Andy 800 more than Tracy so you get an idea of how they're breaking up within their departments. Comparing to the average department compared to the next high salary department different numbers. This one is really interesting. I can find the person's rank within their department using S and also the rank within the company. Here is the rank within their department. Here is the rank within the company. A good example of using different window clauses in the same query. Any questions? I just want to end with an overview of things you need to remember to kind of gel it in your head and we can all head out. First, when you're setting up a window function do you want to split the data? You're going to need a partition. Do you want an order to the partitions to the value in the partitions? How do you want to handle rows with the same order by value? Range mode or rows mode? Rank or dense rank? A lot of times people writing window functions get caught up and don't test for duplicate values and duplicate values show up to get the wrong results and you don't want somebody waving a sheet of paper in front of you saying this number is wrong and then you have to figure out why and you've forgotten something that has happened to me. Do you need to define a window frame? Maybe you're going to use it in mobile places. Window functions can define their own partitions in ordering in window frames. You can have mobile different window frames in ordering partitions in the same query. They're independent. Mobile window names can be defined in the window clause. You can define it multiple times and then pay attention to whether a window function operates on a frame or a partition. The documentation is going to tell you that. So this is the slide I promised. It basically shows you that some functions operate on frames some operate on partitions. For the frame example some of them are generic variants like some in average. Some of them frame operate on row access, first value, last value and next value again operate on frames. These functions all operate on partitions. Again some of them are related to row access and some of them are related to ranking. Again feel free to get this slide next time you have to roll up your sleeves because this is a nice handy thing. In terms of what we have implemented in Postgres 10 and Postgres 11 we are adding some things to Postgres 11. Effectively we are allowing a range to specify peer groups who values are plus or minus the specified proceeding or following offset. What that effectively means is you can go backward or forward based on how many peer groups are before you. If you looked at my example every time we did following and proceeding we were in rows mode. If you do following or proceeding in range mode that's what the behavior you get. And you'll get an error in Postgres 10 and Postgres 11 that'll work. Also we now have a groups cause which didn't exist before in addition to range and rows and that allows me to specify the number of peer groups plus or minus the current peer group. This one is plus or minus the specified offset in terms of a literal value like a numeric value plus or minus 10. This one how many peer groups are before by current row. And finally we have a frame exclusion cause which you can now use after your partition frame which allows you to exclude things like the current row exclude your current peer group or exclude other peer groups to sort of filter out different results. So that is it. Again I apologize for going a little over but I wanted to give you a good 20 minute break. I also didn't want to cut off any of your questions. So I think we've had a good discussion. It really hasn't worked well in terms of allowing you to understand it and hopefully you're going to go home and make better use of your SQL engine to do analytics to do imperative kind of operations with common table expressions with a lot more confidence. So again these slides are on my website. Feel free to download them any time or take a look at what else you have and I hope you enjoy the rest of the conference.