 Check one two all righty shall we get started? Yes, no, maybe Maybe okay, let's get started My name is David Federer I Work for an outfit called postgresql experts that concludes my marketing spiel for today I would like to thank very much the FOSDEM organization for inviting me over here and for Very kindly paying for my air ticket. That was that was really above and beyond the call And I'm going to talk about some things which until now Weren't really easy to do in SQL databases or at least not in the free SQL databases that anybody had any experience with Okay, so that's a lot of qualifiers. Is anybody using firebird? Okay, well So there there there is one free database that actually has this stuff, but I'm going to talk about it in postgres so one of the things that you find yourself needing to produce is These yes Somebody might notice that You don't normally have to produce any of these in production, but it's nice to know that you can And yeah, it's a little bright in here, but that's a yeah Okay, is there a way to dim it a little bit or no not so much I don't want to put anybody to sleep anyhow So the things that make these that make it possible to do both of these things with your SQL engine are windowing functions and Yeah, I don't see a light dimmer here Yeah, the the seat ejector is the one I'm scared to touch it's I Don't know which seat it's going to eject and whether it's going to open the roof first or you know Yeah No, you know if I touch one of these things I'm going to destroy it I better not Okay, so anyway use your imagination a little bit that there is a mantle brought set So windowing functions are things that let you deal with lists such as the kind you would of lists You would use when you're generating one of those TPS reports Recursion is something you would use to deal with tree like structures Has anybody ever tried to deal with the tree like structure in a database system and enjoyed it? Okay, well, we're we're gonna bring the joy Today is today we bring the joy to that to that process So what the what windowing functions let you do that? That's the list part They let you see outside of the current row in your result set which Let's you make those TPS reports better. So for example, you could have a running sum and a running sum is a kind of handy thing to have in a reporting system and in you know, maybe a display on a on a On a point of sale system, let's say So windowing functions operate on a window and a window is In general a subset of a query it may not or of a result set It may not be a proper subset So it may contain the entire result set, but it's some chunk of a result set which you've carved off It returns a value for each row in the window And then it calculates the value from the rows inside the window Windowing functions let you use The new windowing functions which all some of which I'll show you as we go along and Existing aggregate functions like sum and count and yes Okay by the way interrupt anytime and and if I'm If I'm not paying attention just like scream or throw things that that usually gets my Okay, you can also make your own user to find windowing functions although those you need to write in C right now and We don't really make any guarantees about the stability of the API So you're kind of you're kind of off the map there if you go to write your own windowing functions But you can do it And of course you can use any user to find aggregate functions which has anybody you made a user to find Aggregate function for both. Oh great That's that's more than usually say so What is yours do it doesn't remember? the So your aggregate function would ball up a Set into or a bag into a set that would unify a list. Oh, that's neat So anyway, you could use it in the windowing context Right away just because of how windowing functions work So basically There we go Here's how aggregates work. They sort of ball They crunch together things from a larger set down to a smaller one With windowing functions you get sort of more of this effect you you get a kind of cross mapping thing where you Where you get an effect more like this so One of the first things you can do or the easiest things to demonstrate with windowing functions why they're helpful is Numbering rows in output now. I know Oracle has this thing called row number and that's kind of handy It's I don't think it's in it's done in the standard way, but you know Oracle has has got its own standard, but has anybody tried to do this without with inside Postgres before Get row numbers. Yeah, not not fun. Is it no So we're gonna add some fun, but here's here's how we used to do row numbers So you have the classical employee table and you join it to the No, this is amp salary So basically what you want to do is is take the employee number a department name and salary count stars row number from Imp salary e1 join it salary e2. So you're gonna cross or you're gonna join it to itself and then you're gonna it's sort of a cross-join so the the So you get sort of the upper part of the triangle of the cross-join and then you group by this and that the other thing and So is this is this query gonna run really fast? Anybody Fast yeah, it's quadratic. It's gonna it's gonna get a nested loop join right It's it's gonna be it's gonna be super unfun for for performance and We got lucky right We noticed it's obviously wrong If you're doing hacks like this in order to get windowing kind of behavior it may not be as obviously wrong as this is and When it is unobviously wrong the people who remind you that it was wrong could be forensic accountants for example and By the time a forensic accountant is is reminding you of anything You can be in real trouble So you don't want you don't want it to you don't want to go there This is why the SQL standard has these windowing functions, which allow you to do this Instead of join instead of instead of cross joining the table to itself You just select from the table and then you have this construct here where you have a windowing function in this case It's called row number and then it says and then over and here's where you you can define a window You can also define it elsewhere. We'll go over that later You say over order by salary descending nulls last Okay, so you have an ordering in there and that's that that's how the row numbers are going to come out So if If two employees have the same salary What's going to happen here anybody First okay. Yeah, the the answer was first one gets first but since we haven't defined an ordering here It's unspecified and you will not be able to reproduce the result Or you won't be able to guarantee that you can reproduce the result so when you're defining your windows You need to think carefully about the ordering that's in there and is it sufficient to make a deterministic Output or you know you could decide that you don't care about Determinism at some level of the output, but you have to at least sort of think that through so that's what our that's what our query looks like and Of course our our row numbers have come out right In addition to row number, which unconditionally increments as rows come out You can also have rank which Which looks for ties and then and then orders by those so if if your numbers are or let's say you have AC ccf so what you get there is a rank like one Four four four five. That's what rank would look like corresponding to a cccf And dense rank you'd see one two two two three in other words dense rank sort of collapses the ranks that you'd get Questions so far comments Anybody still awake? Okay, yes No, you'd get one two two two three with a dense rank I think Okay, my mistake. Thank you It actually piles the the the ranks towards the top instead of the bottom Thanks for thanks for noticing that so here are the Built-in windowing functions that we that that got added in 8.4 Row number ranked that's right. Well, you can read all those here if you really want to I'll I'll demonstrate some of these and and how they work So we've seen row number just bumps it unconditionally over the window This can be handy for you know numbering results Rank we've seen So there's a gap So I guess they're yeah, maybe I was mistaken about being mistaken here because Well because we have you have two here. Yeah, I guess we do so we're Yeah, there's there's there can be gaps in between them and I guess it piles it towards the top instead of the bottom Dense rank just Closes all those gaps in your ranking Percent rank Scales the you know normalizes it not in the sense of database normalization, but like in the same sense of a vector So it it squeezes it all into the interval from zero to one That's percent rank Which is the regular rank Normalized to that interval You can also look at cumulative distribution, which is any statistics geeks in here Okay, well good. Well, then you know more about this than I do. I hope it's useful Entile so, you know quintiles or quartiles or Three aisles I forget That sort of divides things into buckets and says which bucket it's in Let's see lag this one's kind of handy It basically returns the value of the row above or you can Parametrize it to say lag by how much or how many rows So the first one doesn't really have a row above so it's lag is Null similarly Lead takes the low row below and of course the last one in the window doesn't really have a row below So we have to call that null too First value takes the yes First value takes the first one in the frame or in the yeah in the frame Similarly last value You have to add some extra Frippery here, which will be on the slides, which I will Publish as soon as we're done here But basically it takes the last value in the frame and value I guess the This was the SQL standards committee's version of humor Because I have not seen anywhere yet where You'd actually need the nth value but has anybody else seen one no No, okay. Well for completeness. I present it here, but it just Yeah, okay, so So when doing can affect aggregates if you meet if you want it to One way you can so here's here's where it's not affecting ag grates We're selecting value in the sum Over the empty window, which basically means we're not specifying anything about order or partitions or anything like that We're just saying some and so it just acts like the sum as it usually did Except that it's an except that instead of bunching it all together. It's repeating the sum at each row There's that fairly Okay But the sort of more interesting thing that you can do with With this running with this windowing function over aggregates is that you can change aggregate behavior so What we've said is value sum of value over Ordered by value descending from table so What's happening at these first two rows is that since you have not Distinguished one from the other they can't really be separated one from the other So in the first two rows your sum is actually the sum of the first two rows because they tie that fairly Okay, so again you again it emphasizes that you have to be You have to at least think about what the ordering is going to do to your result set And how how much you specify so if I'd said Order by value and maybe something else which distinguished the rows one from the other then you would see 510 Etc. Okay, so yes The question was the other ones which only work on a partition lag and lead Why is that well You can partition that you can partition your result set into different slices and when you say You don't actually need You don't actually need to say put you don't need to have a partition by clause in your windowing function But if you do have it lag and lead will only refer to the stuff inside the partition and not the whole window Right, so if you've said partitioned by ID or partitioned by last name It's going to to do the lag and lead only in the context of the partition that you have sliced off Does that answer your question? Right, so if so if there's no partition if there's no partition by specified then the frame is the window and You're done Thanks Okay, so that's that's enough about the TPS reports. I don't want to put you all to sleep too too much Well, you know for it was a large night for some of you last night I imagine So I'd like to talk about something else that's that's really new and different in SQL and that's recursion Everybody read that is that legible back there? Okay, so we have a Google result set and it I'm looking for recursion and it says did you mean recursion? Yeah, okay So recursion is fun and and and you know it it can actually come up in real context I'm going to show you how we generated that little Diagram earlier So we introduced a few little chunks of syntax and Here's the first one. So we say with recursive X of I okay, so what we're doing here is we're saying Here we are going to make ourselves a one or more Temporary views or subroutines, whichever way you want to look at it Which are good for the context of this query and then we'll vanish So we say with and then recursive is an optional keyword and that talks about whether the views or or Subroutines can refer to themselves and Then we define the name of the the view and the names of its columns Optionally, so you don't have to name the columns. You do have to name the view So you've named the view and then we're going to see what we're going to Define what happens inside the view. So the first thing we do is Get us a zero value up here Sorry, anybody got a pointer So that's the first step in the recursion is that when you recurs you have an initial condition and then you have some sort of a Recurrence relation and then if you're smart, you'll also have a termination condition Just in case Yeah, so you say in the recursive view you say initial condition Union or union all What would be the difference between union and union all generally? Anybody Yeah Yeah, the answer was unique strips strips out duplicates and that's that's what's happening So union all would just pile everything together ignoring duplicates Union would strip them out and will cause a performance You know change because it has to then do that stripping So union all is usually faster and if you know as in this case We do that the results are already unique and that's how we wanted them anyway. We can just say union all so You start with this temporary view which just has One row in it and that row is a zero And then we say select I plus one from x that's our Recurrence relation Where I less than a hundred and one And that means if we we keep going Except that we have to check that I is less than a hundred and one What happens if I leave off this where clause? Yes, it continues counting and Well It continues counting until you overflow integer. I think Yeah, so be careful when you're doing this so what I've done here is I've generated some number of points in the x-axis I suppose How many of them? It's it's it's zero to a hundred and one because it can actually get to a hundred and one before it stops Right, so it's a hundred and two points It's really easy to get off by one, isn't it? I mean it's not just see where you can do that It's all kinds of places. Okay, so we have this set of points Now we're going to Make some more points which are looking more Cartesian as we go along So we select I x I y x float y float from Let's see Do a little affine transform here on both of the result sets and cross join them with themselves and Has anybody done here here done a cross join on purpose before? Good for you How about by accident Yeah, it's really easy to do a cross join by accident You just say from table one comma table two and you forget to put in a where clause and then You get a Cartesian product Okay, so we've we've picked from this sort of affine transform of a hundred and or ten thousand four hundred and No, anyway some was around ten thousand points And then we're going to do this recurrence relation So we have our our little union all here that signals that were we're in a recursion And then we're going to sort of you know do this little quadratic mapping thing and and bound it and 27 hmm wonder what that could be well. We'll find out I guess So we're So that's our next sub routine as we've we've taken the set of a hundred points, and then we've somehow transformed it into this block of 101 by 101 and then we've iterated over the block for some number of iterations So once we've done that we have this giant result set that we need to slim down because you know We don't want to be in 127 dimensions. We want to be in two and the way we do that is we Project this into the Cartesian plane by just choosing I X I Y and the max I For so we just sort of collapse those results into the biggest one we found From Z group by I X I Y order by I X I Y Okay, so we've now got a set of a hundred and one by a hundred and one Points and we want to be able to display them. Well in the grand tradition of of Aschimatic We're going to display them by indexing into a into a string right, so we're gonna You know things that are things that have one are going to map to a space and to map to a dot all the way out to 27 which maps out to a space or 26 I think Okay, and we say greatest of I comma one which is to say that if we happen to have Yeah, if we happen to have an I that was Zero or negative we just call that I a one Just in case because we already choked off I to be less than 27 Right, we just want to make sure it fits in the array bounds because otherwise we could have this We could have some sort of a stack overflow kind of attack and Who knows what could happen? Right, so we mapped it We've we've grouped it into an aggregation and then collapse the aggregates into a string and Voila Okay So that was actually a fairly easy problem as math goes How about let's try a somewhat harder problem like maybe yes, yes Yes, so the Right, so the question was does the does the Rowset actually materialized into memory And I believe the answer is yes Well, you know, nothing comes for free and It's unfortunate, but there it is okay, so back to our slightly harder problem, we're gonna try something that's NP hard Right, so we have a traveling salesman problem This here is a reminder To make sure that before you start embarking on an NP hard problem that you actually have to solve That problem Everybody read this But assuming that you do have to solve the problem you have to set up data structures and you have to start You know doing your computation on those data structures, so that's what we'll do I'll have a very simplified schema It'll have a table it's called pairs You got a from city a to city a distance in between we're gonna say that From city and to city defines uniqueness on this table and then we're gonna check From city less than to city Why do we do that? Well, one answer was from the city to itself, but that's yeah, that's part of it You just want what you several people said you just want to go either Here to there or there to here, but you don't want to have different what could be different distances So we're assuming that the distance between Say bar in Bologna is the same as the distance between Bologna and body And so we just limit it to only only representing one of those Okay, so we have our table Let's insert a little data in there bunch of I Forget where I got this but it was important. Sorry Yeah, well baby Okay, so now that we have our data structure, and then it's populated We need to do some computation on it and by the way the computation we're doing is sort of a cheating version of brute force and I'm cheating because I want to make sure that well that that the Program terminates in a reasonable time, and I'll show you how I cheated in a little bit Okay, so Having made sure that we can't actually have The paths in both directions stored in our table We now proceed to double the size of the table so that we get that all the paths in all directions That Yeah, okay So that's what we're gonna do here is we say from city to city and distance we said with recursive And there's a union all down there, but it's not actually a recursion We just said with recursive means that we can use recursion somewhere in the common table expressions or temporary views or subroutines That we are about to create So We select the distances in one direction from city to city distance and then we Hello select them in exactly the opposite direction to city from city distance So we just doubled the size of the table We want all of those We want all of those edges for our graph right the One way in the other way Okay, so we've got that data set and then we Try a little initializing a path so We're starting out from Rome and That's going to be our our from city We say to city and that's all the possible places you can get directly to from Rome The distance in between and then we're going to just put Rome in its own Postgres array, which I think is fairly standard. I'm not sure I haven't checked with the standard on how arrays are handled But it's a it's a very handy thing to have around And we pick that from the the both ways Results at which we've just created By the way, any Anybody want to guess why I we had to pick this as a starting city? Thank you little geek humor Sorry about that Okay, so now that we have a Now that we have a the the paths that the the Paths that go from Rome to suck to one other city we need to keep going until we get back So we've we've got those going outward and then we want to Choose more so we starting from the the endpoints of our previous Our our Previous hello I'll just drop right here. I hope everybody can read it. So we have Rome and We just did this To each of the cities which are directly accessible by one hop from Rome So then we're going to start from each of these and go on to other cities and so forth and so on Everybody is that fairly clear on that more explanation less? Faster slower Okay So We joined back to that previous result set that we just created and We join on head to tail sort of For the For the edges right the the tail of the next edge is the head of the last edge then We make sure that The from city is not in the path Except for the first element right so we want the Well, okay, so why would we do that? Sorry? We do want to get back, but only once So we don't want to have extra loops in there our our Salesman is supposed to travel in as short a Time as possible and looping infinitely It's probably not it So so we only want the one cycle and that's why we have left off the The first element in the array, which is where we're trying to get back to I also Remember the part where I said I'd cheated about this Can everybody read this okay? Says array upper of p dot path comma one that means the first dimension in the array its Upper bound is six so basically We're saying we got to get back there in six hops and if we don't get back there in six hops It's probably not our shortest one anyway Why would we say only some Specified number of hops Anybody nope Good the quest the answer was total number of cities actually what it is is we're trying to prevent the sun from burning out while this query runs And Well if it's less than the total number of cities we would find it we would hope to find a solution in Some smaller thing then start at Rome go to every other city in Italy and come back to Rome Which is the which is what you do if you're that that would be the longest path basically You actually want the shortest path that hits the cities described which I had to describe below because of some limitations in our Implementation so we hit we said start at Rome hit these three cities and come back That's that's what the traveling salesman problem Is all about So I'm living I'm saying that I have to hit three cities, and I hope that In the process of hitting those three cities. I only hit at most six okay, so This is just to prevent the sun from burning out and me from getting bored and you from getting bored While we're doing this exhaustive search So it's not actually O of n factorial which is pretty good because there was like 60 cities in there and 60 factorials kind of big and long and boring Okay, so now that we've got these paths We'd like to Find the final one where so we've got all the paths that have Rome as a starting point and are less than Six long And don't have Rome in them so we have to finish this off and say well of those Some of the paths will be one hop away from Rome again, and those are the ones we want So we're picking All the paths with Milan Florence and Naples in the path so those are the cities we wanted to visit I still haven't gotten to Milan and Naples. I hope to someday Anyway, so we want to order by distance because the idea of the traveling salesman problem is that you is That you want the minimal distance so we're saying distance ascending and then Order by path also. Why would we why would we do that? One answer was get the shortest pathway It's actually we already took care of that by distance any other Guesses why we would order by path along with distance The answer was eliminate duplicates and that's right because if you go If you go one way through a path and then go back the other way through the same path They will have the same distance according to our model so that's that that's one of them and Another one is if if it happens that two two completely distinct paths or at least two Partially distinct paths have the same distance. We just want the same answer coming out Each time we run the query Because if we have a tie from two different paths, we just want we just want to trim it down to one Okay, so that's our That's our traveling salesman problem solved in 12 seconds, which isn't too bad for brute force Well, I was gonna go through Some things about posting on a forum, but apparently I have taken too much time so right now I would like to I'll post this later and I would like to oh, yes by the way SQL is churning complete. So it just be really careful about what you let people execute SQL on and I would like to Open up the floor for a brief questions comments and of course the straight jacket. I've just earned Questions comments. Alrighty. Thank you so much