 I'm gonna say this one last time before we get started, which is, we are going through all of Postgres's many, many scores of configuration settings, which means there will be a lot of text on screen, which means that you wanna follow along one of two ways. One is, you can move to the front of the room. The second is, all of this material is in a Git repo, which you will want anyway, so you can take it home with you, and you can download that. My annotations and comments and recommendations are available as a Postgres database dump in either PGDump or for people using earlier versions of Postgres in 10 SQL format. It's available as a massive CSV file that you can load up in your favorite spreadsheet. And then I have my recommended configuration files, which we will be going over in this tutorial in that repo. So if you download the repo, then you will be safe from the blurry projector. So that is, again, GitHub.JBurk is annotated.com, which we are about to move on to the next screen. I'll bring up the URL again. So welcome, everybody. Welcome to GUX, a three-hour tour. I'm your skipper. We're going for a three-hour tour around the bay. And we'll hopefully not end up stranded on a tropical island for 11 years. But we'll see. My name's Josh Berkus. I work, actually, I work for Red Hat, and I work on Kubernetes these days. But for many years, depending on how you count it somewhere between six and 14, I tuned Postgres 2.0 for a living, which is a pretty good living. And I'm actually very happy to say it is actually less of a good living now, because a lot more of the settings and a lot more of Postgres' performance out of the box is becoming what you would want out of the box, which was a very happy thing to have happened, even if it meant, because it meant I got to move on to a different job. So the, but it's awesome. But we're going to go over here, which is just because a lot of the defaults are better, doesn't mean that you can leave the configuration alone entirely, although a lot of people do just that. And in case you are interested in tinkering with Postgres QL's configuration or at least knowing what's involved, you're in the right place. So we're going to talk about docs, understanding them. I talk about a little bit of what's changed recently, go over how you manage Postgres QL configurations, then show you sort of a simple basic configuration, and then we will go through as many of the configuration settings as we have time for, starting with the 20 or so that are the most frequently touched in my experience, or should be the most frequently touched, following with sort of the next group of 20 and then the whole rest of them if we have time. There is a whole bunch of related stuff in terms of Postgres QL configuration to the postgresql.conf in its related settings, such as hostbaseaccess.conf, recovery.conf, pgident.conf, settings for the libpq client, PSQL setting commands, and system functions, and we are not doing any of that because we have plenty to do just with the general configuration settings by themselves. What's it got? Who knew before they walked in the room that it was supposed to be pronounced gux? Because first of all, one of the things I love is people do is an acronym GUC, lowercase s. The s is actually part of the acronym, it stands for Grand Unified Configuration Settings, back in Postgres 6.5 days, Peter Eisenthraut, took the five different configuration files we had scattered around the place, consolidated them all into a single text file and a single C file, which was thereafter called the GUC, the grand, the gux file. And this is where you set all of the settings that are going to control sort of whole database, whole database server behavior. Again, accepting some of the things that we're not going to cover like access control, which is handled through HBA, and part of your replication configuration, which is still in a separate file, although really hopefully in Postgres 11 that will stop being a separate file. So I will say two things that might surprise you. Number one, tuning the configuration file is seldom your best way to improve performance in the database. You will get far more performance bang for the troubleshooting hour by fixing your queries and fixing your application request interaction with the database. Because the most speeding up you can get of a particular query is to eliminate it entirely. So this is actually not your first round. This is honestly what you do next after you have already been through your application and done a reasonable job of tuning your application behavior. If you're looking at this from performance perspective, not all of the settings are going to be going over here for performance. There are some there for security, some are for reliability, some that are for logging, and those things you're probably going to need to change anyway. But really what we have is 269 as of Postgres 10, switches and levers for controlling database behavior. That's a few, a little alarming to most people, but it doesn't need to be alarming. We make it more alarming than it needs to be. And here's the second thing that that might be a violation of conventional wisdom, which is when you install Postgres QL, depending on how you install it, you are generally given a massive Postgres.conf file that is, I don't know, 40, 50 pages long. First thing to do, delete that file. That contains every configuration setting there is. You are highly unlikely to ever touch like 200 of the settings. And as a result, having a file that has 200 settings commented out in order to get the 10 or 20 or 30 settings that you actually do anything with, is just adding a mental burden you don't need. So you actually do need to save a handful of lines from that file, which is the things that get set at a NITDB time based on calculations or based on say file locations. So save anything that's not commented out in that file, delete the rest, and start over with a fresh file. And I'm going to give you an example of a fresh file based on the things that most people need to adjust. And don't worry about writing this down now, we will go over it later in detail. And if you downloaded the GitHub repo, if you do it in the future, you will actually have this in text file format. So, but first, let's talk about what GUXR and how they work and how you set them. 678, 678, who do we appreciate? GUX, GUX, GUX, it's work better in Chicago we have the hockey team, but the, so 678, what do we have here? Well, there's six, what are known as context. This is the unified configuration setting. So it's not configuration setting for one particular thing or one particular set of behaviors, but really for everything. And what that means is that you can't necessarily change these configuration settings at the same time for all settings. And so these, what we call context, indicate when you are allowed to change that setting. The vast majority of the settings you're running through GUXR, what is known as session level context, which could also be called user context. And these are settings that can be changed at any time, including in a user's individual session. So a user can individually say, good example of that is obviously time zone, right? Is the user can say, hey, I wanna change my time zone to Malawi because I wanna see things in that time zone. And they can do that at any time in the middle of the session. So most settings are actually session settings. A second sort of step up from there in terms of restrictiveness of what are known as super user settings. So super user settings can be changed at runtime in a PSQL session, but only by the super user. Only by somebody with super user that is sort of group permissions on the database. And super user is a specific thing where it says alter role, Josh with super user. The Postgres user in most cases is the original super user for the database installation. The next one is one that you'll see as reload context, which is actually abbreviated in the PG settings file as SIGUP. And that goes to Linux process management or UNIX process management, where you have to send a signal to the backend processes to tell them to reread the configuration file. And so what this says is you need to change the setting in a file, in a configuration file, and then you need to signal all the Postgres backend to reread that file. Now, in the old days, the way we used to do that is we would go on and we would use a PS command to send the HUP signal to the backend. Now a much better way to do it is log into the database as the Postgres user and send the command PG reload code. Easier to remember, a lot easier to do via remote connection and tell it to reload its configuration file. It is important to know that we are signaling backend, so all of the various existing Postgres connections will not see that configuration change simultaneously. They are waiting to receive a message and if some of them are super busy with a workload, they won't see it for a while. Then next one up in terms of restrictiveness is restart. These are settings that can only be changed with a full restart of the database server. An example is static memory allocations. You're telling Postgres how much dedicated memory it has, how much memory it has for the auto vacuum demon, other things that are allocated statically, those require restarting the Postgres server which obviously means a downtime for your application. So hopefully you want to limit changes to those as much as possible for that reason. So those are our six contacts. So here's our sort of reloading things, right? Like I said, the old way is use kill to send a signal on the command line, it still works, fine, we accept that. But much better ways are from a super user session, you do pgreloadconf or if you need to do something on the command line because it is part of a script on your server, then you send pgcontrol reload which will also do the same thing. And those are better methods because among other things, those two methods reload and pgreloadconf will actually fail if they don't manage to signal the database server. Whereas the kill command will just happily return and you won't necessarily know what happened. So to reload it, so two more, the two remaining contacts, one is backend. So these are generally for developer slash debugging settings. A lot of them require certain things to have been compiled into Postgres in advance that you don't necessarily, that aren't necessarily part of default Postgres packages like log monitoring or something else. Some of them have to be set before you start Postgres, some of them only work in single user mode, which is a mode for troubleshooting a Postgres server that won't start correctly. And I'll be showing you some of those, but hopefully you will never actually use these developer, it will, hopefully you will either use these developer settings or you won't use them. The hopefully is, hopefully you will use them because you will start contributing to Postgres QL and you will use these for troubleshooting new patches. The hopefully you won't is the other reason you use these developer settings is because you have a corrupt database and you're trying to recover it. So, and then finally internal. So the internal context is stuff that is not changeable. It is actually a setting that is supplying lead only information about something, about either how Postgres was compiled or how it was installed. So a good example of that is the log file, the transaction log incremental size, default to 16 megabytes, but you can actually change it. And that has to be compiled into Postgres to change it. So that's information only, but because some people are actually looking at changing that, it's information that might be different from installation to installation, and for that reason it's worth checking. Now we are to our seven. And this is the seven different places that you can change configuration settings. Yes, actually seven because in the world of Postgres, in the world of Postgres we don't ever want to, we never eliminate anywhere where you could tweak a knob. We just add new places. So obvious one, Postgres.co.conf file, what almost everybody starts with, what a lot of people stick with, pretty much forever. By the way, remember myself for just a moment. If I got to mention in table context, we will be taking a 15 minute break at 11 o'clock in case you were wondering when you can get a bio break. So we'll take a 15 minute break at 11 o'clock, and resume at 11.15, and finish the rest of the tutorial. So this is what everybody starts with. This is what people may work with for their entire time managing Postgres. It's a single configuration file that controls all of Postgres. It's just a file. You can do anything with, you can do with other files. It can be relocated, it can be simlinked. There's actually a couple of ways to relocate it in the configuration settings itself. And you can do other things with it. It's the primary way that a lot of people change these configuration settings, particularly all of the settings that you want to be built into Postgres at startup time, like login configuration, like memory allocation, like sensible defaults for a whole bunch of things. Now, like I said, the default one you get is this giant list of every setting there is. The, and my opinion is that's not helpful, because in order to change the handful of settings you actually want to change, you have to wade through, like literally dozens of screens of stuff that you're not going to touch. So again, I'm going to show you starting with a fresh file. Now, here's a couple of important things to know about loading these configuration files. Number one, use bash commenting to comment out things. Anything that's commented out doesn't take effect. And it operates on last-in-wins. So if you have the same configuration settings at multiple times, whichever one is at the bottom of the file is going to be what your actual setting is. Either, and to make this even more complicated, you can add include files. So you can actually have Postgres.conf include another file. And then it matters which order you are including those files in, because again, we're still last-in-wins. So you do want to keep track of that, and if you end up troubleshooting a complicated Postgres configuration, it'll probably be because of that last-in-wins problem. Now, a step up from that, and something I recommend if you are going to use config files to manage your Postgres configuration, is to actually not have a single file, but take advantage of the feature that allows you to have a configuration directory. So you do include if exists the directory, and then you drop multiple configuration files into the directory. So you have one that configures say performance, another one that configures logging, and another one configures some reliability stuff, and another one configures replication to separate files. One of the reasons to do this is, for example, one of the things you may frequently change with Postgres is you'll want to swap out logging configurations because say you want to do a 12-hour intensive performance monitoring of Postgres. So you want to swap in a more intensive logging configuration and then swap it back out at the end of 12 hours. That is easier to do, particularly through a configuration management system, if that logging configuration is a separate file. Now, these files get loaded in alpha numeric ASCII order, within that directory. So do keep that in mind if you're going to have files that deliberately clobber each other's configuration settings. That you still need to do last and wins. What it really is, if you're going to be managing your Postgres configuration using Puppet or Chef or Ansible or whatever, try using the configuration directory with the separate files. It will actually make your life easier. We're not limited to changing configuration stuff, using configuration files, though. So the first way that we can actually do this on the SQL command line is set and show. So show tells us what a configuration is. So show tells us what a configuration setting is and set, yes, there's no worm memory, because worms notoriously have no memory at all. VM is we can set it in a session. Now, this obviously only works for session level parameters, or for super user level parameters if you are the super user. But for a lot of things, this makes sense, whether you're doing it interactively or whether you're doing it in the script. You know, with that you can actually go ahead and set those session level parameters for things like time zone, for example. Because one of the things I strongly recommend is everybody's Postgres server should be running in the UTC time zone and that if people need individual different time zones that should be set by the application. And you should be using time zone TZ for all of your variable storage, for all of your column storage of timestamps so that all of this works. Now, one of the problems with set and show is that these are actually SQL primitives and the parameters that they take are not strings. These are identifiers, which means that if you want to actually set them dynamically as part of the script, you have to do the annoying string concatenation rather than being able to pass them around those proper database parameters. So in order to enable that, we have an alternate version of set and show which are a set of functions, set, config, and current setting. And those actually take a string. And these are the best things to incorporate in a script where you're going to be supplying that value dynamically because then you can do it as a proper Postgres parameter. We have another thing that ships with PostgresQL. And this is the PG Settings table. Mostly the work of Joe Conway, who was actually in here earlier talking to me before the session started. And PG Settings is a giant administrative, what we call a pseudo table, as in it looks like a table, but it's not actually a table. So you can see we've got a whole bunch of information there about every setting. And if you actually look at PG Settings, we've got the information, the configuration name, what the current setting is, what its unit is, if it's relevant, the category, a short description of it, which is generally the first line of whatever's about that setting in the documentation. The extra description is supposed to be populated for the documentation, but it isn't for most settings. It's just kind of a to-do context for when you can reset it, what the variable type is, what the source is, what the range for that setting is, et cetera. So you can get all that from PG Settings. Now, what a lot of people don't know is that if you can actually, if you have permissions, which generally requires you to be the super user, unfortunately, you can actually change settings by updating PG Settings as a table. And that calls Set Config under the hood. There's a whole nother sort of level to set settings on, but I feel like people really don't make enough use of, which is four session level settings. You can actually set those as the property of certain database objects, most frequently as properties of either databases within your Postgres installation, or most frequently, most usefully roles as in users. So like the biggest reason to actually set this is for example, imagine if a database where you have a consolidated database where you're doing both supporting web application and running analytics reports against it. You might want to have a separate web user that has a very low memory allocation because you're expecting to have a whole lot of connections for the web user. And an analytics user that will only have one connection but a much larger memory allocation so they can run big analytics reports. The best way to handle this in terms of maintenance is to actually have two separate login roles for those and set that memory allocation as a property of the login role. The other thing particularly that you often do this is for a variable called search path, which governs how you find objects in the list of Postgres schema. As a matter of fact, managing search path for people who are using multiple Postgres schema became such a thing that a few releases ago we added a feature to specifically allow you to set a setting on a role in a particular database because obviously if you have multiple databases in your Postgres installation, the same search path is not gonna make sense in all of those different databases because your list of schema folders will be different. And this is really something for certain, I'll be talking about certain settings later on that really should be set only this way because they don't make a lot of sense to set them globally. The other reason why you would set things on a particular role is obviously for localization. If you're geographically distributed, one of the ways you can manage is, hey, what time zone, what collation are people going to see things in, linguistically, et cetera, is to actually manage these properties of a role. So for example, Brazilian user actually has a separate login role in the same group as a British user. The other place where you want to set parameters, very importantly, how many people are using Postgres stored procedures? And I'll give you this, okay. How many people have some security to find their stored procedures as in stored procedures that deliberately escalate privileges? If you are doing that, you need to set a bunch of configuration parameters attached to that stored procedure because otherwise someone can use that security definer feature as a ready-made security hole by overriding things it depends on. Particularly, you absolutely need to set search path because that's the main way to do that because for example, if the user is able to override the meaning of the equal sign, they can have that stored procedure and do pretty much anything they want. And this is one of those ones that I used to say is something you shouldn't really be doing at runtime, but a change in how people deploy applications has changed that. Which is you can set your configuration settings by setting it on the command line with Postgres. You start up Postgres and you set a bunch of settings on the command line using there. It's a little bit better to set the pgoptions environment variable, which will have all of those things in there with the pgoptions and those get supplied to Postgres command line. Now, this used to be something that would pretty much only made sense for automated testing. I mean, otherwise this facility dates back to when we didn't have a configuration file. Except that now people are running Postgres in containers. And it is really easy to supply a bunch of environment variables to a container within a container orchestration environment. And so this starts becoming a much more attractive way to set your settings. Then the actual management of your settings takes place on the container orchestration master, which then supplies a bunch of environment variables to the individual Postgres server as it starts up through the pgoptions variable. And as a result, you can manage settings on that without needing to reach into the container and try to change the file. Particularly if you're using an orchestration manager like Swarm, which doesn't really have an analogous concept to config maps with Kubernetes apps. So this is now a thing again, and a good option for that kind of environment. And then the final one, this is my favorite one because it really sort of changed the game for how you manage Postgres settings, which is the ability to manage persistent settings on the command line, Alter System Set. And what Alter System Set says is, first of all, it requires super user permissions, and you log in through the SQL thing and you say, hey, I want you to change the setting, but I also want you to write it to a file so that it will persist across Postgres restarts. So you can actually change things overall that way. So for example, so we can go ahead and change, I can change the default time zone for the database through Alter System Set. And then this actually gets written to a file in the PG data directory, regardless of where your other config files are located, called pg.postgres.auto.conf, and auto.conf thread before some things and after some others. But the, and I actually strongly recommend this for, well, it depends. Depends on your infrastructure setup. Again, if you are in a new cloud stack environment, for example, containerized environment or open stack environment or somewhere else where Postgres is being regarded with immutable infrastructure principles, then this is a much better way to manage your settings. But this does mean that you have to have the settings under management somewhere that your central orchestrator has access to, because the settings have to be under management somewhere. And if you just change them dynamically in the digital Postgres instance, then you have difficulty recreating that if you have to recreate the server. This provides you with a more programmatic way to manage a lot of settings. Now, that's seven different places that you can manage your settings. And you can imagine what happens if you have a large distributed development theme and everybody has changed their setting a different way. Couple of Postgres versions ago, there was really no choice other than sleuthing for a long time and changing settings in various places until you actually saw what stuck. But now, there is a view called PG file settings. And this will show you every setting that has been changed from the default, along with where it was changed. And if the place it was changed is a file, what line of the file? Which can be really important because of last-in-links, right? You can say, hey, I keep changing auto vacuum workmen when nothing happens. Well, it turns out it was a much lower line on the same file. Well, you can see here, we've got this, and let's see if I can get some of the ones that are not out of Postgres.conf. So for example, the one that I just changed time zone is that's now in Postgres.auto.conf. So you see it's coming from a different place. So this is your way to avoid having these multiple locations become a total nightmare. So we have eight basic types of values that settings can be set to, in terms of the variable types. These will not exactly match up with what PG Settings tells you because PG Settings increments some values in different ways. So the obvious and the simplest one is Boolean, which is settings that can be on or off. Then there are settings that only take whole numbers because these represent something outside the database, for example, account of connections. We'll take whole numbers, which are integers. There are floats, and these take real decimal numbers, and these are often for performance tuning settings, which will take real numbers, et cetera. Now, there's a whole set that are memory and disk size. These will appear if you look in PG Settings as if they were integers usually. But the important thing to know is that you can actually attach a unit size to these to make them much easier to understand and remember for setting them. So for example, you can set work mem in terms of megabytes, which is easier to remember than to remember that work mem is in terms of eight KB pages and that you're actually setting it to a number of eight KB pages and work that calculation out on paper. Unfortunately, if you are an RDS or some other managed systems, they only give you access to the raw values, and so you still have to actually look up that calculation. But if you're setting it on bare metal postcards, it's not an issue. Time. There are a number of different variables that are set in terms of time. These time values again are usually show up in PG Settings as an integer, but I've called them out and annotated as actual time settings. They take reasonable values as for seconds, min for minutes, HR for hours. They don't all have the same minimum maximum ranges because some of them have a base unit of milliseconds and others are a base unit of hours or minutes, but you can set them using those. And again, better for self documentation. Strings are obvious. These are actually some strings off in a file path or a file name or an identifier. There are a number of string values that are limited to a specific list of values. So these are enums, as it were, and they actually do show up in PG Settings that way. There's a specific list of values like wall level, which can be one of three specific things. Oh, which actually a hot standby is not one. I forgot to update the slide. We changed that set of values in Postgres 9.5, so. And then there are lists. Now lists again are saved as a string comma separated. So for example, search path, which has a list of schemas. And then those get parsed inside Postgres into a list of things. We've gone through the different places and how to set things. So does anybody have any quick questions before we move into other things that have changed recently? Yes. Okay, so the question was, postgresauto.conf is that persistent between restarts or is it only active during user session? And the answer is persistent between restarts. That's the whole point of having it. If you wanted something that was only persistent during user session, then you just do the regular set or set config. And that only exists in that individual user set. Not only is it not persistent, it does not affect other user sessions. So, but alter system set and creating postgres.auto.conf are persistent. You can in fact even edit postgres.auto.conf by hand, but make sure you get the formatting right. Because if you get the formatting wrong, then postgres will refuse to start. Any other questions? Okay, so a few things have changed. Now, one of the things is I actually have not done this particular tutorial since I think postgres 9.2. So I was updating it last week and a lot of things have changed. And for anybody who is on somewhere in the nine series and is looking at postgres 10, how many people are running a postgres 9.3? 9.4? 9.5? 9.2? 9.0, anybody running 9.0? It's out of support, but I think it's still supported by Red Hat. 9.1, I haven't seen you kept moving your hand. What version are you running? Oh, 9.6. Oh right, I forgot we did a 9.6, didn't we? Yeah, so 9.6, yeah. So, but if you look across the nine series, we actually eliminated 23 configuration settings, which is awesome, right? Because every configuration setting is just a deferred design decision. And so limiting 23 configuration settings makes things easier for the user. And it'll be wonderful to submit we had 23 less configuration settings, but unfortunately we limited 23 configuration settings and then we added 69 new ones. Now to be fair, an awful lot of these 69 new ones specifically have to do with new features, such as configuring replication and configuring in the three different forms of replication, configuring parallel query, configuring some of the improved maintenance stuff and a bunch of other things that were added to postgres as features. But nevertheless means here is another group of settings you might potentially need to know. Now, these break down into a few specific areas. One of them is this whole, and this is a very new set of settings to configure memory flushing behavior on sort of a fine-tuned basis in terms of when things get flushed from memory to storage. I honestly don't have any recommendations on how to set these, these are new, I haven't seen anything in the way of documented tested performance papers published on them. So if you're interested in them, then follow the postgres performance mailing list. Identification, we've added more places for you to stick an arbitrary ID for that individual postgres instance, which is really friggin' useful if you are running multiple postgreses on a server. And people are like, who's gonna run multiple postgreses on a server? Well, a lot of us, like including me, are doing container orchestration now, which means you are running multiple postgreses on that server because each one's in a container. And so it's really useful to be able to identify those. So one of the practices I have is in my container I set these things to the name of the container. And then I know that postgres is associated with that container. We have overhauled SSL config because quite bluntly the original SSL config in postgres was pretty bad. For one thing it made a lot of assumptions, your SSL key files had to be present in the postgres directory and you couldn't move them, which was kind of a problem. So now you can actually stick them wherever you want to, but it does mean we have a bunch of new performance and bunch of new configuration things for SSL. We added a bunch of timeouts and resource limits and I'll go over these more later on because these are actually awesome and I recommend that you make use of them. And then of course all of the settings for the different kinds of replication, including logical replication and synchronous replication and all of the other things that have been added over the last seven versions. One of the things most recently in postgres 10 is wall compression, the ability to compress the write-ahead log in order to improve IO performance. And that comes with its own couple of configuration variables. Parallel query comes with a bunch of configuration variables and we'll have more in the future as the team working on that paralyzes more things. Every time they paralyze a new query operation they'll give you the opportunity to turn that off. Now, not all of this is adding to your mental burden. One of my favorite changes, and this was from 9.5, either 9.4 or 9.5, where we actually made things a lot simpler. One of the things that used to be on my list of things everyone has to set is the setting called checkpoint segments, which define the size of your transaction logs on disk, well the maximum size. Well the maximum size according to a really complicated calculation that required at least two different sets of parentheses in order to work correctly, which meant nobody really understood how to set it. I mean, experts did, but regular users didn't. But you had to set it because the defaults were way too small. So, got rid of that, Hikey added a change where you simply give postgres parameters where you tell it the biggest transaction log can be and the smallest that it can be and then postgres dynamically resizes the transaction log according to the level of activity on the database. And that maximum size is in megabytes or gigabytes. It's really simple. Plus, importantly, the defaults work for 95% of users. So, something now that you're not going to touch that you always had to touch before and you don't have to touch now. So that's awesome. And I wish we had more changes like that. But, so any questions about new things before we start getting into the configuration settings themselves. Again, showed you this before. Like I said, 269 settings, of which I will go over a whole bunch. But, most of them you will never touch. These are the 19 or so settings, and I say or so because some of them are actually groups of related settings, that you are the most likely to need to touch or that I recommend that you touch in terms of changing them to get a reasonable postgres configuration. Fortunately, I have recommendations on how to set all of these, which we'll be going over. So, if for anybody who checked out the Git repo, you can actually follow along with this by opening the file postgresql.10.simple.conf. And this is an example, simple configuration file that has my recommendations for settings you need to touch. When your installer generates that default postgresql.conf, you will need to drop in those file location settings that it generates that tells it where the postgres files are, to add it to the file to actually have this be a working configuration file, because those settings are required for postgres to know where its files are. But, these are all the settings that you would actually manually change. I'm most likely to manually change. So, I will go over this. Now, there is a whole bunch of settings in postgres that cannot be set automatically, because we need to set them according to how much RAM you have available on your server for postgres. Now, anybody hazarded guess as to why we can't do that automatically? Why can't we just say, you've got 32 gigabytes of RAM on your server. I'm just gonna calculate all those. Any guesses? Yes. The server may not be dedicated to postgres. Postgres may be only one of several things running on the server. There's a second reason as well. Any guesses? The second reason is actually determining cross platform and cross environment when you include things like VMs and container environments. How much memory you have available turns out not to be a trivial thing. I mean, postgres is available currently on like eight different platforms. A lot more than that if you include the various versions of Windows. And saying how much RAM does the system have turns out to actually not be an easy question to answer in a simple piece of code. So, that's two reasons. So, as a result, we require you, the user, to tinker with those settings to tell postgres how much memory it has. So, if you look at this recommendations file, it uses an abbreviation called AvRAM, which means available RAM, which means the amount of RAM that postgres should have available for it to use. You know, if it's a dedicated bare metal postgres server then that's really simple. It's all the RAM on the machine. But if postgres is running alongside of the things, if it's in a VM, if it's on a container, or something like that, this needs to be the RAM that's actually available for postgres. Now, before we get into those individual settings, let's get into connection settings because obviously you're not going anywhere with postgres if you can't connect to it. And the first thing and the thing that almost everybody ends up touching almost immediately, even if they never change a single other postgres setting is listen addresses. If you're installing postgres in bare metal, it installs in a secure configuration, which means that postgres only answers local connections in the local machine, which is almost certainly not what you want it to do. So in addition to altering pghp8.conf for the access control list, you also need to go in and take this listen addresses setting and set it to something else. Most frequently, star, which just says listen on all interfaces. However, if you have a machine that has multiple networks available to it and those networks are not all equally secure, then you want to give it specific networks it should be listening on. For example, an environment where that is always the case is Amazon AWS, right? Every AWS instance has a public IP address and a VPC IP address. And you generally want to set postgres to listen only on the VPC address, not on the public address because if your postgres port is on the public internet, someone will find a way to attack it. The second thing that you often need to set is max connections. The max connections is how many connections will postgres take before it stops answering them? And unfortunately, setting this is fairly application workload dependent. As in, you have to have an idea of what's reasonable for your application workload because a public web application that's using Rails or another framework where each application back end tends to spin up a database connection, whether it's going to use it or not, is going to require a lot more postgres connections and be able to make use of a lot more postgres connections than say a data warehousing analytics application where each connection represents a data pipeline that's pumping through gigabytes of depth. So now that being said, each postgres connection, postgres is a multi-process database like Oracle is and as a result, each connection represents a process on the host server or a thread, a full thread on Windows. And that means that there is a fair amount of memory and other resources devoted to each connection. So if you find to satisfy your application requirements that you are raising max connections up above a few hundred, then you should probably look at some kind of intermediate connection pooling. Because you're going to find that your postgres server or VM or whatever is spending all of its resources dealing with managing those connection processes. SSL, turn it on. Not turned out by default in a lot of installations because it requires you to install a separate SSL package. But you should do that and you should install SSL. None of us, was it crunchy outside, soft interior is a poor security approach. And as a result, you should really be planning on securing connections, particularly anything as important as a database connection, which is possibly containing information that you really don't want to be out in the public. So you want to turn SSL on, so we use SSL connections with postgres. Pretty easy except that obviously you need SSL configuration in the server and depending on what level of auto configuration your installer does, you may need to tell it where to find those SSL certificates. In the end, you may need to change permissions and on Linux change SE Linux settings for those files so that postgres can access them. So now we're getting into some of the performance configuration where we're talking about this memory configuration I talked about earlier. You know, as in you need to set some of these things. So shared buffers. Shared buffers is the amount of memory that postgres statically allocates via M-Map as postgres is dedicated cache and sort of work area. That cache is managed using a frequency algorithm and as a result, you don't actually want this to be the majority of RAM. And this is one of the places where configuring postgres is very different from configuring Oracle. Oracle has a multi-level cache that you actually want it to. Like you're setting up an Oracle server, you set this to like 90% of RAM. For postgres, this needs to be a minority of RAM. And by minority, I mean about a quarter with a limit of about eight gigabytes on high RAM machines. Just because there's a lot of, and if you're gonna get more into should that limit really be eight gigabytes? Again, I say subscribe to PostgreSQL performance mailing list. There's a lot of sort of argument back and forth about this is how much RAM can postgres really make use of in this dedicated cache. But it can reliably make use of two, four, eight gigabytes. Above that, maybe counterproductive. So set this to a minority default is I believe 128 megabytes, which is actually kind of on the small side for even machines you might use on even Amazon instances, which tend to come more in the one gig, one to four gigabyte flavor these days. So you wanna raise this actually a little bit for the sort of most efficient operation of caching and working with data in postgres. The way that you can actually tell interactively whether or not you should change this is by looking at postgres' cache hit ratios, which you can get from some of the postgres system tables, where you can actually see how often are we hitting cache for commonly accessed tables and commonly accessed indexes, versus how often are we having to go out to the file system cache for them. But you have to use your understanding of the database when you do that, because if the database itself is 25 times the size of RAM, your cache hit ratio is gonna be poor no matter what you do. And increasing the size of shared buffers won't actually help you there. That's just if your database is a fraction of the size of RAM, that cache hit ratio should be really high and if not, consider raising shared buffers. Next setting, work memory. So work memory is how much RAM do I have for an individual operation in this session? It is not an allocation, it's a limit. And it is non-shared. Not only is it not shared between sessions, but it is in fact not shared between operations. So if you have a complex query that does three simultaneous sorts, it could use three times whatever this work RAM limit is, which is particularly important to keep track of when you are doing parallel query. So this is a limit and the reason why it's there is to prevent you from severely from having, hey, I'm gonna have 100 simultaneous connections, each of which is going to use 128 gigabytes of RAM on my four gigabyte Postgres Adabase instance. Doesn't work out well. We would rather start cutting things off than to over allocate the heck out of the server. So in order to calculate this, you need to know actually figure out what's a rational setting for this before you start responding interactively to things is you need to have an idea, not just how much RAM you have, but what kind of workload you're running on this database server. Because if you have a sort of interactive application workload, like a web application workload, then that has hundreds of connections, then you can be setting this really low. Because first of all, most of its queries should be primary key lookups which require no work memory at all. And second, you have a whole bunch of current connections. So allocating a bunch of work memory is kind of a problem. By the way, one of the other ways which allocating a lot of work memory is a problem. If you're using persistent connections in your connection pooling, that work memory allocation never goes away because the process on the Postgres side never goes away. So you will see over time, the actual RAM consumed by each of your Postgres backends will go up to roughly double work, whatever the work mem allocation is, you know, for that. And that can be a reason to actually rotate your persistent connections. So for a web application, a simple way to do this is to actually take your total amount of RAM and divide it by max connections, assuming that that max connections is a maximum that you expect to actually be enforced. In other words, if you set max connections to 200 because you expect that sometimes you might actually have 200 connections, then yes, divide it out. If you set max connections to 1,000 because your boss never wanted to get a connection refuse message, but you know, 99% of the time you only have 40 connections, then you might actually divide this out by something lower. The other thing is work memory is actually allocated by doubling. So any remainder you have left over from a power of two is not used. So I like to remind myself of that just by always setting this allocation of powers to two doesn't hurt anything. If you don't, it's just that if you set it to like 315 megabytes, nothing over 256 is gonna be used. Now, on the other side of things, if you are running an analytics application, you're going to want to actually allocate a bunch of work memory so you can do big sorts and merges and aggregations and other operations require a lot of work memory. But in that case, you're actually expecting to have a lot fewer connections. Now, where this is complicated is if you have a mixed use application. If you have a mixed use application, remember I said about setting this as a property of the individual login role and this is when you want to do that, right? You want to say, hey, my analytics application is going to have work memory set to 256 megabytes, but my user facing application is going to have it set to two megabytes and you would do that by doing an alter role. Maintenance work memory. So talking about maintenance work memory, this is the memory that is used for two things. One is when you manually run maintenance commands, such as vacuum or analyze, and also when you do bulk loads for honestly, largely historical reason, bulk loads, particularly bulk loads of indexes use maintenance work memory and not work memory. So, now since this is for the interactive manual one, I generally set it to roughly about 1 1⁄8 of RAM because I'm unlikely to be doing multiple manual maintenance activities at the same time. Even bulk loads tend to be serialized, so they're not using multiple maintenance memory in the same, you know, session at the same time. Keeping in mind that there's been a number of performance tests that show that setting maintenance work memory above two gigabytes doesn't really help you anymore, and may in fact be counterproductive. I don't know if that's been updated recently. The last time I saw somebody actually test that was for Postgres 9.5, I think, where we're debating raising the limit on maximum maintenance work memory. But at least at that time, setting it more than two gigabytes. And certainly, if you're doing it out of the box, don't set it more than two gigabytes. Most of the time, lower values than that. But, you know, you want that vacuum to complete quickly and you don't want it to complete more a manual vacuum, particularly generally if you are doing something like a manual vacuum or a bulk load, it's really important for you that it complete quickly. And so you want to give it all the memory it needs to do that. Now, one of the reasons why I said, hey, that's wrong. One of the reasons why I said manual maintenance is that a couple of versions of Postgres ago, we added a separate variable auto vacuum work memory where you can actually give auto vacuum a different memory allocation than you have for manual maintenance procedures. Now, it's still defaults to using maintenance work memory. If you see that this value is set to negative one, the default, then it is the same as maintenance work memory. But you don't want it to be the same because auto vacuum now uses multiple background workers. Default of up to three. If you have a super busy database with a thousand tables, you're actually going to want to raise the number of auto vacuum workers. And as a result, you want this allocation of memory for the auto vacuum to be lower than you would have it for manual maintenance procedures. Because you really want auto vacuum to be a background activity and you don't want it to be pushing a whole bunch of stuff out of the file system cache every time it runs. So on a default system, we're using three auto vacuum workers. And most of the time only one is busy at any given time. I set this to a 16th of RAM. Believe it or not, you're actually going to want to set it. And if you raise the number of auto vacuum workers, you're going to need to set this lower. Effective cache size. Now, this is different because everything else I've talked about from memory up until now has actually been either an allocation, like shared buffers, or a hard limit, like workmen, right? Effective cache size is just an advisory to the query planner. It tells the query planner how much memory you have available for file system caching on the system. And Postgres wants to know this because I mentioned that Postgres' dedicated cache is only a frequency cache. What we rely on for our second level cache is the file system cache, you know, from our Linux file system or a Windows file system or whatever. So we rely on that as our sort of bulk larger cache. But Postgres needs to know in a very rough estimate how big that is so it can guess whether or not the database object you're trying to access is likely to already be in the file system cache or whether Postgres is going to have to go out to long-term storage. So the easy thing for this is though, if you are setting shared buffers to about a quarter of RAM, available RAM, then you set this to three quarters of available RAM. And again, it's kind of as other query planner advisors, this is kind of an order of magnitude thing. So you don't have to be precise about it, you know, as long as you're, you know, within the reasonable sort of factor of three of it, it will advise the query planner in the correct way. The default for this is four gigabytes, by the way, which is why a lot of people don't have to adjust it because most people running on RDS instances or smaller servers or shared servers or whatever, the default works fine for them. Now, random page cost, as other advisory, this is the one disk setting that I put in our top list of things to adjust. This used to be one that we told you not to touch all the time because people would touch it as a way of clobbering the query planner, which would result in as much undesirable behavior as desirable behavior. That's not true anymore. And the reason why it's not true is, random page cost represents the ratio between the amount of time required to seek for an individual database page on disk versus scanning a larger file and finding that page. Now, on spinning disk, that ratio has stayed at a relatively consistent four to one through multiple generations of spinning disk technology. However, increasingly people are not using spinning disks for their databases. They're using SSDs, they're using solid state arrays, they're using high end sands with enormous RAM caches. And for all of these, seeks are much less expensive than they were for spinning disk. And given that, you actually, in any of those environments, which includes things like Amazon, EBS, solid state, and other cloud storage, you actually want to lower that random page cost, like a lot, to 1.5 compared to 4.0. It's still more expensive to do a seek, but it's not much more expensive to do a seek. And you will get, the query planner, when you do this, the query planner will make better decisions about whether to use an index or a direct table scan, whether to do an index operation in general or a full table operation when running queries. So I'm gonna pause just a moment and before I go on to some of the other top settings, do we have any questions about the connections or the memory disk settings? Questions? I'm covering everything. Question is, can the SSL reference, can the SSL settings reference a PEM bundle instead of broken out key and certificate files? The answer is yes, but I would actually have to look up exactly how you do that. Because you do have to do a weird thing when you're using PEM bundles, but there is a way to make it work, and I just have to look it up. You should lower it, yes. Because if you are raising the number of auto vacuum workers, which is a setting you can do, and again, the case where you'd raise the number of auto vacuum workers, I've seen this a lot, a lot of people who are doing multi-tenant applications may have literally 100,000 tables. In that case, three auto vacuum workers is not gonna cut it to keep up with maintaining, cleaning all of your tables. So in that case, you might have five or 12 or 15, but the thing is, if you're having additional auto vacuum workers, that's because you expect those to run concurrently, which means that you need to lower the amount of memory each has available, or you will find those auto vacuum workers consuming the majority of RAM, which will crowd your file system cache. Anything else before I go on to some safety settings? Okay, so safety settings. So these are all about database reliability, and data safety. The first one is a hard one because it can only be set at a NIT DB time. That is when you initialize your database directory, which means if you're already up and running in production, it's going to be very hard to turn this one on, unfortunately, and this is called data checksums. And the idea of data checksums is we add a checksum value to every database page so that we can find out if it has been corrupted in memory or on disk, as soon as we read it back. And the reason for this is obvious, which is the reason you want to do this is, hey, memory corruption and disk corruption happens. Even if you had phenomenally perfect hardware, the most expensive hardware in the world, there's still things like cosmic rays. Cosmic ray bit flipping is actually a thing if you have enough machines. And so stuff does get corrupted on disk. It will happen to you. And when it happens, it's really useful to find out as soon as possible that something has been corrupted because one of the ways that corruption can express itself is it can be stealth corruption in which a query simply returns the wrong results. And that is much worse than getting an error message in most circumstances. So you want to use data checksums if you can. Now, the obvious reason is not to use this. Number one is you're already in production and re-initializing the database server which would require a downtime-based backup and restore to make it happen is just something you can't schedule doing. So in that case, just think of this for the next time you do a Postgres version upgrade. And maybe when you do the version upgrade, also turn J to checksums on in the new version. The second reason not to use it is it does add extra IO. There is a significant amount of extra IO in terms of rights to the transaction log in order to support checksums because there's information that now has to be transaction log that would not have been if we weren't checksumming it. Plus some CPU for generating the checksums itself. So there is a performance overhead. And if you are running really close to the wire on performance, then you might be dubious about turning checksums on. But even that I'd ask you to reconsider because here's one of the problems. Depending on how the corruption has happened, database corruption can follow replication. So saying, hey, I have lots of replicas, so I'm safe from discorruption of the database. You are not safe because that corruption can get replicated depending on how it's expressed. And then all of your databases are corrupted in which case you don't actually have a backup that helps you. So turn data checksums on if there's any way for you to do so. Next one, this is one of those features that I honestly waited for my whole Postgres consulting career to get installed and of course it became a Postgres feature after I stopped working on Postgres. I don't want transaction time out. So open transactions in Postgres consume resources. Particularly they consume resources in two ways. They pin things in versions and sometimes in memory in a way that prevents other concurrent sessions from evicting them or from, in the case of auto vacuum cleaning room. Like if you have an old row version, a deleted row, that if you have a deleted row that is no longer visible to most users because it was deleted 15 minutes ago. If you have a transaction that's more than 15 minutes old that row can still not be scrubbed from the in memory or representation of your database. And that doesn't sound like a really bad thing until you consider something like queuing tables that might be deleting hundreds of thousands of rows a minute. And none of those can be scrubbed. So, and it's one thing if the reason that we cannot manage this concurrent activity is because you have long running reports and other, you know, necessary work that takes a long time. But what's really stupid if the reason that you're pinning all of these resources is because some application, some buggy application opened to transaction and then went to sleep. And it happens a lot. Like when I was a Postgres performance tuning consultant, we had a bunch of scripts we would run that would look for idle and transactions. And this was, you know, hour one of setting up our instrumentation at the customer was to set up his idle and transaction monitor because 75% of the time the client would have an idle and transaction problem they didn't know about. And those, depending on the rest of the workload on the database, that can be a minor problem or it can be a devastating problem. So, you want this to be turned on. That simply looks for a band and transactions. Transactions where there haven't been any command sent. So this is the time from the last time that a command was sent in that transaction. Now you obviously don't want this to be zero because applications need some kind of think time. But I would start with this at like half an hour and then look at cranking it down. Because really there aren't a lot of legitimate reasons for an application to need to think about something for a full minute. But don't start there because I think you'll discover if you've never done this before, if you said it's 30 minutes, you'll discover there's a bunch of broken connections. And one of the other things I discover is that buggy applications that abandoned transactions also seem to handle having their connection terminated poorly. So, you're gonna have to do some cleanup on the refactoring on the application side once you turn this on. But it'll be good refactoring, it'll help you. Statement time out. So here's the other thing. And this is a little bit more complicated to set because you have to actually know your application workload well. But here's the other thing. Sometimes queries get stuck for a variety of reasons. They get stuck behind IO queuing. They get stuck because it's a really terrible query plan where we're re-scanning a really big table 100 times. They get stuck because of process management problems or something else that hangs up the query. It happens some of the time. Not a lot of the time. Well, depending. Performance, bad query plan problems, when they do happen, tend to happen on 100 connections. But other problems, it doesn't happen all of the time. But the problem is that if it only happens 0.1% of the time, depending on your query load, that could still be a lot of queries. And those long running statements consume a lot of resources because usually that's the reason why they're long running in the first place is they're consuming a lot of resources. So if they are long running for bad reasons, you actually want to kill them off and preferably you want to kill them off automatically. And that's where statement time out comes in. Now there's two things about statement time out. One is it requires you to figure out what's the longest running legitimate request I might have. And then second, you want to set this on a per application connection basis, application basis, right? Because what is the longest running thing is again going to be a question of which application am I talking about, right? Because on the one hand, your Node.js web front end that does 98% primary key lookups, a reasonable statement time out for that might in fact be two seconds. Whereas PGDump is going to need a statement time out measured in like a couple of hours because it's dumping the whole database. So you're probably going to want to set this by log in role in order to get a reasonable setting for each application. But you might be surprised that actually setting this takes care of a lot of your performance overhead problems because it nails that less than 1% of operations that are getting stuck. And it's better for the application to retry those than to keep consuming resources. Temp file limit. This is another safety one that distressingly a few people know about. I personally think we should have it on by default, but I lost that argument. So temp file limit says if any database operation needs more than this amount of space on disk, terminate it. Now, why do we use things on disk? Well, imagine we want to sort 12 gigabytes of depth. And we've set work mem to be only 128 megabytes. Well, we're not allowed to do that sort in memory because we're trying to reasonably resource allocate our memory. And so where do we do that sort? Well, we do the sort on disk. We do a tape sort on disk or a modified tape sort. And we actually have a second kind of sort we now do. But anyway, but we do a disk sort of the data so that we can actually expand it. But that disk sort is using up a significant amount of disk space in IO. And the thing is that if a really bad query comes in, somebody manages to mess with your search interface so they pass a search that is actually, if executed sorting all 100% of the information in your database, and I have seen it, this can result in Postgres attempting to use more temp file space than you actually have disk, which then results in downtime that's really irritating to recover from. So you wanna actually set here, this can be a high limit, but something you wanna have some kind of limit here. And the first thing you do to limit is look at your available disk space, right? If you have 30 gigabytes of disk space available, you should probably set this to some small fraction of that, right? Set this to like five gigabytes. So a single operation is not gonna use more than one sixth of your available disk space. Beyond that, you have to sort of once again consider your workload, right? Analytics applications are going to require a higher temp file limit than interactive web applications or mobile applications that often are not doing any distorts at all. And one of the ways that you can interactively monitor this is there is a setting called log temp files in the logging section that will tell you how much you're actually using. So you can get a good idea of what's reasonable for a limit on this. So questions about limits before we move on to replication. No questions on limits? On safety, okay. Okay, replication to archiving. So for everybody who's using Postgres in production here, are you using Postgres binary replication? Yes, no. Nobody's using Postgres binary replication? The standard hot standby master slave replication? It would be a little surprising. Wow, okay. Maybe I should have given a replication tutorial instead. The, are people using continuous backup? Postgres archiving, at least, I hope. Okay, so all the rest of y'all are okay with losing like a whole day worth of data, apparently. However often you take Fiji dumps. Well, actually, well, is anybody here using the new logical replication or anything? I mean, that's new, it's new intent, right? It's awesome, but it is new, so I'm not surprised nobody's using it. So I'm going to say if you have important business data in that database, you should certainly have a replica, or a continuous backup archive, or preferably both. Because if you cannot easily and quickly recover all of your database data from its source material, then it's not something you wanna do. That's what we call a resume outage, which is when the database crashes and you send out your resume because you know you can't recover it. You had a question? A question was when I was suggesting that the master have a continuous backup. And yes, I am suggesting that. In addition to the replicas. And the reason why is replicas are great for fast recovery, and they're going to be a primary method of recovery. But sometimes the event you have to recover from happened on the database level, like somebody said, drop table customers. It's amazing, like you can run in infrastructure where your replication lag due to bad network problems averages 15 minutes. And the minute somebody sends drop table customers, replication lag is zero. So your replicas won't help you recover from that kind of a mistake. And that's why you wanna have continuous backup ideally. I mean obviously if you have a really small database and you can do PG dumps once an hour, then that is an option for you. And so the reason is that continuous backup or PG dumps is your only way to go back in time if you have to. If there's a problem that happens not at the hardware and server level but at the inside the database level. And that's what I'm suggesting you do both. So there's some settings here. And I actually give you, and again like I said, these replication settings, like the logging settings kind of come in groups, it only makes sense to set them with a combination of them together. Now one of the really nice things that somebody, I think Hikey did in 9.6 was to adjust the defaults for a bunch of the replication settings to reasonable values. So you no longer need to change things in order to turn replication on at all. It's automatically on the moment that you allow replication connections in pghp.com. So that's nice. However, you actually probably do wanna tweak a couple of other things if you're using replication production. So here is an example of just your basic binary asynchronous replication, you know, a master to an asynchronous replica, which is 95% of all post-credits replication. So one of the things that you might want to tweak is a setting called wall keep segments, which says keep this much extra transaction log in case the replica needs it to catch up. Because what'll happen is if the replica loses its connection to the master because like of a network outage or something, then it needs to catch up on transactions it might have missed. And the way that it catches up is from the master's transaction log. But if the master has recycled that portion of the transaction log, then you can't catch up at all. Now, by default, post-credits doesn't tend to recycle the whole transaction log particularly quickly. But I found just to avoid, you know, bad timing based incidents where you have like a five minute network outage that happens at exactly the wrong time and you lose those transactions. Set wall keep segments, which just gives you a little padding to a small value like four segments, which is a total of 64 megabytes worth of log. Now, if you have a very fast transaction load where you're going through 16 megabytes of transaction log every 30 seconds, then you might want to set this to a higher value like 128. The, but not necessarily because what I'll suggest in that case is you really want your replication to be using replication slots, which is a better method for this. What wall keep segments end up covering you for more is actually for starting replication to begin with, where when you spit up a new replica, when the new replica first comes up it needs a certain amount of catch up time. And it doesn't have a replication slot yet to hold the bookmark for that. And so in that case, you want to have that wall keep segment. So, but a small value. One of the things that's not on by default on the standby, for reasons I disagree with, is hot standby feedback. Because I think you want to use it most of the time. Hot standby feedback says the replica should tell the master about its workload so that the master doesn't delete data the replica was looking at and then send that deletion over via the replication stream because that causes something we call query cancel where after a certain timeout limit whatever's connected to the replica gets a termination error message saying the data you were looking at has been removed by replication and so you have to start over. And most applications are not prepared to handle that kind of termination. And so you don't really want it to happen. And one of the ways to prevent it from happening is the replica tells the master about its workload. Now this is a trade-off because that means that there is less stuff that the master can clean up and vacuum and that sort of stuff on its own because effectively it's carrying the workload of the entire replication cluster. So, if you have a master that has 30 or 40 replicas you might want to refact your application to deal with the query cancel rather than having hot standby turned on. But most of the time for most users they do want hot standby feedback. Track commit time stamp not on by default don't understand why. This simply says, hey, every time I do a commit I'm gonna have a time stamp with that I'm gonna have that available by certain post-quest functions so that you can keep track of when did this commit happen. The overhead for this is not substantial and it's useful information to have particularly if you're going to say particularly if you're going to monitor replication lag. So a better way of monitoring replication lag and saying, hey, I'm gonna flip a boolean every 30 seconds and see when it changes on the replica. The MaxWallCinders max replication slots the defaults for this are pretty good for most people it's 10 and 10. And most people are not running with 40 replicas. Most people are running with one or two in which case you don't need to touch the defaults. But if you are the person who's running with 40 replicas then you need MaxWallCinders max replication slots to be higher. I generally set it to double the maximum number of replicas you ever expect to have. There's two reasons for that. Number one is right when replication starts up depending on how you are starting replication it can use two replication connections instead of one. The second reason to double is every time you say the maximum X I might want to use you are going to estimate too low. And changing MaxWallCinders requires restarting the master which means downtime. Which means most people don't want to do it. Replications thoughts are bookmarks that the replicas hold on the master for when they last left off syncing data. It's required for logical replication it's a good idea for binary replication. So again you want this to match up. Now, new logical replication. New feature in Postgres 10 allows you to replicate individual tables allows you to filter replication allows you to replicate across different Postgres versions but it does mean make replication more complicated because you are replicating individual tables not old database which is a much simpler concept. But lots of you are going to use logical replication in the future for these and other reasons. And in that case here's an example sort of configuration with some of the stuff you have to raise what's called the wall level. So wall level is how much stuff information am I writing to the transaction log because the replica doing regular replication requires a little bit of extra information to the transaction log. Logical replication requires even more extra information to the transaction log. And by the way this means that if you're not doing any replication at all for whatever reason in that case you actually want to have wall level set to minimal so that you can minimize your IO with the understanding that if you ever want to do replication in the future you're going to have to change the configuration and restart. But so if you want to do logical replication you have to raise that level from the default and you want to do that. Also if you have logical replicas that are not doing binary replication you actually want to turn hot standby feedback off on those because they're still making replication connection to the master. You don't want them publishing their visibility information because that's not useful and it's just a burden on the master. And you definitely most of the logical there are different logical replication plugins but most of them require you to turn track commit time stamp on. Then finally archiving where you're talking about continuous backup. And this actually only has a couple of settings associated with it. One is archive mode which turns archiving on and the other is an archive command which tells it what to do. Now this archive command should be a command supplied to you by the documentation of whatever piece of software using to manage archives. And I'm hoping that piece of software is something better than an ad hoc bash script. Yes, you can do an ad hoc bash script but really you should be using something like Wally in an Amazon environment or PG Backrest in a bare-model environment or something else that does long rotation, snapshot rotation, scheduled regular full snapshots and provides you metadata so you can browse your own backups because doing this ad hoc first of all is likely to fail in unexpected ways when things you didn't expect to have happened and second makes it really a management nightmare. But you'll supply that archive command. Now here's one hint on this. Turning archiving on requires restarting the server. Changing the archive command does not. So my recommendation is even if you're not doing archiving now, turn archive mode on and set archive command to something like been true on Linux. And been true will always return success so the archiving will always happen successfully. And then someday when you actually install PG Backrest you don't have to restart the master in order to make it work because you can just change the archiving command. Also, that is the way out of a bottleneck if your archiving starts failing and you fill up your disk because postgres, if archiving is failing postgres will keep the transaction log forever. So your way out of that is change the archive command to something like been true that returns success automatically and postgres will start deleting those archive settings because otherwise it's very hard to get out of the situation. So we are at 11 o'clock now and my last bit of this set of core recommendations is going to be logging settings. So we will take that up at 11.15 when we all come back for a break. So see you in 15 minutes. Thanks, I'm going to get more water. Sure, it's over. Hey folks, we're gonna start back up. Let me get started here. And we will get into some logging settings which is another area we're gonna touch. And actually in terms of settings that you are going to touch frequently and repeatedly it's going to be these logging settings because postgres really allows you to log a truly amazing number of things. So much so that turning them all on can mean that you're writing more information to the activity log than you're writing to the database. I've seen it and so you have to be. So the answer is not to turn all the logging things on all the time. The answer is to turn them on specifically when you're going to actually make use of them. So in here and again, like with the replication settings the logging settings only makes sense when you're combining several of them to do different, you know, to produce a particular result that you want in terms of logging. So for example, a lot of people, most people are installing on bare metal are actually going to be logging postgres to syslog. And this means that you actually need to set a couple of things. Number one is, so, postgres is a setting called log destination and log destination is where is the log going to be output to? This is actually a list. You can actually tell postgres I want to write the log to syslog and I want to write it to a custom file and I want to write it to a CSV file all at the same time. Postgres will do that. The IO effect is probably not be desirable, but it will do that. But most of the time people are logging to one source or another. So under most circumstances people are going to be logging to syslog because they have some kind of syslog management in place because they have some sort of unified cluster wide syslog because they have the running containers or something and they've got syslog feeding off to Prometheus. But in any way it's going to do syslog. So one is, you have to tell you're logging to syslog. You have to tell it which syslog facility you're using if your system has multiple ones. By default in a new fresh Linux system you have I think three of which local zero is for applications. But obviously if you're logging to something like say our syslog, this is going to change. The syslog facility name is going to change. And then the important thing that a lot of people forget to set is the syslog items which tells it how you're identifying this particular Postgres. And I'm saying you actually should have some kind of unique name there. A lot of people just do this. My syslog item is Postgres, which is great up until you have three different Postgres backed applications, each of which has three different replicas and you're using our syslog or logly or something to unify all of your syslogs across your whole enterprise. And now you have a million entries that all say Postgres and you don't know which Postgres it is. So give each one a unique name according to whatever naming schema makes sense to you. If you're running pure bare metal hosts can work. If you're running containers, the container names can work. If you're running VMs, the VM names can work. Something that is going to identify which Postgres this is that's producing the logs. The other thing you want to add for syslog because syslog does not allow us to reformat the entry is you want to add a logline prefix which supplies some extra information about what got generated. So this suggested logline prefix gives you the database name, the login role name, the user name, if application name is set, which I suggest setting in the client, if application name is set, the application name and the IP address of the client which is useful or not useful depending on whether they're using connection pulling. The other reason for suggesting this particular logline prefix is this is the one PG Badger uses if PG Badger has to get this information from syslog. So, and I will explain that in just a moment. Windows uses the log destination event log otherwise you can set a lot of the other things the same up above. Here's an example for having a private Postgres log and not a heavily used Postgres server where you just want to have a log to their log Postgres and have it rotate weekly. And this is an example of that. So the way that you actually manage Postgres rotation is with the two settings log file name which uses strftime escapes and look up that's actually the escapes are listed in the documentation. strftime escapes to name this, that's the day of the week and then whether or not you actually truncate the log segments on rotation. The alternative to giving them rotation is to append. So you don't really want this Monday's log appended to last Monday's log under most circumstances. The, and then here we're actually logging every query that takes more than a hundred milliseconds which is log min duration statement. Most of the time if you are monitoring query performance you're gonna want to use log min duration statement. There are two other settings log statement and log duration that you use under other special circumstances but most of the tools that I know of that will analyze your query performance want log min duration statement. The difference is log min duration statement waits until the query completes and then on one line logs the duration of the query and the query itself. Log duration to log statement logs the statement when the query starts and then logs the duration when the query completes. The problem is if you're doing that to something like syslog it can be very hard to reunite the duration line with the original query line in whatever your log analysis tool is. And for that reason log min duration statement tends to be all more useful. There is one circumstance under which your log statement is much more useful though. Guesses? Anyone? If the query is not completing then you need log statement to log the query that's not going to complete. So, oh and log statement versus log min duration statement they actually interfere with each other if they're both turned on at the same time so don't turn them on at the same time. You end up with a really weird log output no log management tool I know of can deal with. Here's an example here. So one of the things that I used to use a lot with Postgres is a performance analysis tool of Postgres called PG Badger. PG Badger takes a log all the things log and then digests it by importing it into importing it and then running a whole bunch of routines over it to analyze things like how many connections did you have in each minute and how long did your checkpoints take and how many temp files did you use and everything else and that requires you turning on all of your log things. Now, PG Badger does not require you to log all queries which is what log min duration statement zero says is log all queries no matter how short they were. However, if you are doing actual comprehensive performance analysis of your application you need to log all queries because a lot of people make the assumption that hey what's consuming all my resources is a few long running queries which is sometimes even often the case but sometimes what's consuming all your queries is thousands or millions of very short queries. I've been in situations actually where set time zone was in fact consuming 10% of server resources because the front end application had been programmed to set the time zone before every single statement or before every entrance into a function. So it was sending, so every connection was sending set time zone several times often several times per second and it turns out in the aggregate that was consuming a lot of resources. So at some stage if you're gonna do full performance analysis you actually want to log all the queries in that case you're probably gonna wanna set log directory and you're gonna wanna send the logs to somewhere that's going to have the storage to handle this because if you're logging all queries on a server that does a thousand queries a second you'll actually discover that that may generate a gigabyte of logs per minute. So you're gonna wanna monitor that and you're actually gonna wanna be actively ready to switch back to default log settings if it starts to fill the disk up. Now a few more of these settings in terms of things that you can log log rotation size is how big should our activity log chunks be. The if you do not need to go through syslog you want to use CSV log that's comma separated value formatted logs that contains all of the registers that Postgres has to log to so you don't need to set up a special log line prefix in order to capture things like user and application name information that's just in there as a column in the CSV. So it's the easiest thing for external sources to digest. For that matter if you have something like Grafana that wants JSON it's trivially easy to convert a CSV file into a JSON file. So if you're gonna be dealing with the regular logs you set it that way. And then we can log lots of things. We can log checkpoints to find out how often how long flushing memory took and how much IO bandwidth it created when it did. You wanna log connections and disconnections to see how many concurrent connections you have to Postgres and what your sort of connection activity is. Logging lock weights is whenever a query or statement or function or something has to wait on another one in order to do anything like it wants to modify a row that another query has opened for modification. Or it requires a full table lock for something because I'm gonna change an index or something else. Then you're gonna have a lock weight. And those lock weights can be very destructive to performance because they can pin resources and under really bad situations they can result in pile ups where you have a whole chain of 15 or 20 or 25 different sessions all waiting on locks to be released. So you wanna look for that activity. The other reason why you really want to at some point analyze lock weight activity is that it is the one common chronic performance problem that cannot be solved by adding more system resources. As a matter of fact, if you add more cores and more memory lock weight problems will get worse because that just means you have more concurrent connections so that they can conflict with each other more. So the problem with lock weights is we can't actually currently log the lock weights when they happen, when something first starts waiting. The only central resource we have that knows what things are waiting on locks because this lock management is distributed for very good performance reasons is the deadlock timeout monitor. Now the main purpose of the deadlock timeout demon is to go through and see two sessions that are deadlocked. That is query one wants to modify row one and then row three. And query two wants to modify row three and then row one and the two of these start simultaneously. They are never going to complete. They're each going to be waiting on each other. And so the deadlock monitor looks for cycles like that and resolves them by terminating sessions until the remaining sessions are able to proceed. Normally the deadlock timeout runs once a second which is actually a pretty good interval by default. But if you're specifically doing lock analysis you want to crank down that timeout in order to, what this set of settings means is I'm going to be reporting on every lock weight that's more than 100 milliseconds waiting. Unfortunately, cranking down your deadlock timeout to something like five milliseconds would actually be really expensive in terms of database performance because that deadlock demon is actually kind of expensive in terms of the amount of memory it touches. So you can't really do that. But 100 milliseconds I find is a good balance between affecting performance and getting useful reporting data. Log temp files, again, like I said, temp file limit, right? We wanted to know how much temp files we're actually using so we can set that rationally. So log your temp files and then we put this one in a long while ago because sometimes people change the error verbosity, which you can do. You can tell Postgres to be super explicit in the things that it's logging and you can tell it to be very terse. Our three options here are terse, default, and verbose. And log digestion tools generally expect default. And so those are sort of your log combinations about some of these logging things. Any questions about logging settings? Or for that matter, the simple config file, which I have supplied to you for your tearing apart and using in your own infrastructure options. Any question about the simple config file before we move on to some other settings? So if you downloaded the GitHub repo, I have a second file that has settings that are going to be touched less frequently but can often be useful under circumstances that are not in the majority of users but happen often enough for you to want to know about them. So this is sort of our second tier, right? That we just went over is our eighth tier of settings that if you are administering Postgres long enough, you probably will touch. And now this is settings that you might want to know about, but you can get away with ignoring. And I've got a little bit more terse format here. And a lot of these settings honestly have to do with performance optimization. So I'll save it as extras.conf thing is if you are not in any way rate limited to Postgres, like if you've got a Postgres instance and it's running at about 10% of system capacity, eh, ignore most of this. But if you're interested in getting into Postgres performance tuning, this is a bunch of settings that you will want to know. Or if that 10% system capacity is crept up to 30% in the last three months, this is a set of group of settings you'll probably want to know. But first let's start with the not so performance oriented ones. Some is authentication timeout. Now one over some of the other timeouts, this is a different timeout, this says if somebody starts an authentication handshake in order to log into Postgres and they don't finish it, when do we give up and reallocate the socket? And the important, there's an important rule for this one which is, whatever is set in your driver or your application on the client side, this should be the same as. So for example, I believe the default for JDBC is 20 seconds. That is default PGJDBC settings, the Java will give up on connecting to Postgres after 20 seconds if it can't complete the connection. Well Postgres' default is 60 seconds. So what happens if Java's given up after 20 seconds and Postgres doesn't give up for 50 seconds? You got 40 seconds of Postgres holding the socket open for no good reason. And if you are having chronic network problems, this can show up actually as really destructive performance on the Postgres side of things, including running out of connections. Like I had a situation once where we were hitting the connection limit even though there were only 25 live connections and the connection level was 300, what was going on? Well it turns out we had a lot of aborted connections and they were not aborted. So I had to fix the situation that was causing those connections to a board. But the other thing you can have is to make that not as destructive in the future is make authentication timeout match the application timeout. Now there's obviously a lot harder to do if you have a diverse array of applications that have all the different timeouts. My argument there is you should refactor those applications in order to synchronize those but whatever. The other thing is there are two other places you can give Postgres a unique name. And again for identification purposes this is useful. Which is you can give Postgres a cluster name and then when you are looking at the Linux process list or you are looking at the Windows application list, that cluster name that you've set will show up in that list. So if you're running more than one Postgres instance on a particular machine that is super useful and it's actually honestly useful even if you're not running more than one in a particular machine because if you have different databases that have particular purposes, like this is the reporting database and this is the transaction database and this is the dev backup. If you're looking at something on a server and you're about to start killing stuff it's useful to have that reminder which database you're looking at. That yes in fact you are looking at the dev database and not the production one. A similar thing to this is Event Source does the same thing for the Windows Event Log that SysLogIdent does for Linux SysLog. So we can actually have an identification of the Windows Event Log once again in case you're using generalized Event Log management and you're consolidating with somewhere you want to be able to identify which Postgres did this come from. Okay, so now we're into more interesting performance tuning. So controversial recommendation for you. Synchronous commit. So here's the reason why you do this. Sometimes people are in an environment where storage latency is very high and their performance constrained and they can't really improve the storage latency. And losing up to half a second of data in the event of a hard crash is okay. And a good example of this and a very common example is I am running an interactive web application in Postgres who's being hosted on AWS. EBS write latency because it is virtual storage on top of virtual storage is actually kind of high compared to bare metal latency. And that latency for can dominate a database application where you're doing small writes. There's also true with other forms of cloud storage like Gluster and Ceph. Is the latency for writing an individual row for writing a single 140 byte row of Postgres can really be destructively high in those environments because you're actually writing a whole 8K page. The whole 8K page is actually writing a whole megabyte block. That whole megabyte block has to go through whatever the cluster starting system is in the cloud storage and that can be extremely slow. So if you're saying, hey, I could lose half a second of data and I'd survive. And I would really like to see the wait queue when I'll stop dominating all of my performance profiles then turning synchronous commit off by default can be an option for you. Particularly because in Postgres, unique among transactional databases, no, there's only two databases to do this. Postgres and Cassandra, you can actually set your level of synchronicity in an individual transaction. So you can say, hey, normally I'm okay with losing up to half a second of data but this is actually saving a financial transaction, financial data. So I'm gonna turn synchronous commit back on for that individual transaction. So you can actually do that and as long as you're doing that for less than 10% of transactions, you can actually get a boost in latency performance by turning synchronous commit off. Wallbuffers. Wallbuffers is a small dedicated area of memory where we marshal data to be written to the right of head log. The defaults are pretty good. The defaults go up to eight megabytes or up to 16 megabytes depending on what you've set shared buffers at. It's a calculation against the size of shared buffers and for any like sort of realistic new system that's gonna use the max. That actually honestly works okay for most people but if you have very high concurrency, if you have 200 busy connections on a system with 24 cores, then you would actually benefit from raising wall buffers just in order to deal with concurrent rate problems. From the last time I've seen any performance testing on the max that's actually useful to increase this to is about 128 megabytes. So if you're in that kind of high concurrency situation, consider manually raising this to that. Wall compression. New feature in Postgres 9.6 I believe, this allows you to compress. So one of the things that Postgres does depending on whether you have it turned on is full page writes to the log. And if you are doing replication, Postgres does not allow you to turn full page writes off because we need those complete 8K data pages in order to replicate them. But the drawback to that is, hey, we're writing a lot of 8K pages when all we did was modify a single 100 byte row. So it's like there should be some way to make that more efficient. Well wall compression does make that more efficient. But like any form of compression we are trading CPU time for IO time. Most of the time in most situations that's a good trade, right? Because most of the time you are more likely to be waiting on IO wait latency than you are to be waiting on CPU queuing. But look at your system. If you've got a bunch of CPUs that are maxed out and your IO waits are low, then turning on wall compression is not gonna be a win for you. Most situations actually turning on wall compression will be a win. Another controversial recommendation, turn the auto vacuum off. When? Well, if you have a database that is just for analytics and so all new data comes into that database as bulk loads, as programmatic bulk loads. And you are going to put a bunch of manual vacuum and analyze statements into those bulk loads at the appropriate points. Then you are better off turning off auto vacuum because auto vacuum has this annoying tendency of kicking in and wanting to do is clean up activity in the middle of a bulk load. Which then makes the bulk load slower. But you only wanna do that after you've instrumented your bulk load code so it is doing those manual vacuum and analyze statements. There's a couple of other, so now one of the things is that people run into all the time is, hey, why isn't Postgres using this index that I have? And usually people try to clobber that over the head by setting random page cost to some ridiculous level like 0.1, which may improve that particular query you're looking at, but will tend to break a whole lot of other queries or to do other clobbering. People don't realize that there's actually some more subtle is a bunch of Postgres. So Postgres uses what's known as a cost-based optimizer. And the cost-based optimizer does a whole bunch of mathematical calculations. It basically tries to plan out all the different query plans by which it could possibly execute a given query and then calculate mathematically how much each of those will cost and then execute the lowest cost one. Well, there's a whole bunch of constants that go into that cost. And two, they're useful in subtly influencing Postgres to favor indexes and to favor functional indexes and multicolumn indexes over single column indexes are CPU index tuple cost and CPU operator cost. So if you've been analyzing a lot of slow queries on your Postgres and you're saying, hey, Postgres is just not using this index even though I think, even though I've done the back of the envelope calculations where, hey, it only needs to access 2% of the table and the index should be relevant and that sort of thing. Before you go clobbering Postgres over the head with some of the less subtle settings, try lowering CPU index tuple cost and CPU operator cost slightly. And you may discover that it fixes the queries that are borderline cases without breaking everything else. And of course, you can do those interactively before you set them as part of the defaults. Default statistics target. Again, default of 100 is good for things like interactive applications. However, if you have big data, you have 10 million, a billion rows, then this is a static sample size. It's not percentage. And it's gonna be a little small. So you're actually going to want to first up the default and then second on specific tables or on specific columns, you may want to actually up the sample size a lot. And you can set this with an alter table statement. Alter table set default statistics target, not default statistics target. Alter table set statistics target to some value up to 10,000 samples. The, so again, interactive application, that sort of thing, something under 50 gigabytes in size, probably don't need to touch this. Larger data, you may want to increase the default, et cetera. So what, and like I said, a good thing to increase the default too would be about 500 or 1,000. You definitely don't want to increase the blanket default more than that. Beyond that, you really want to be looking at setting this on tables or even on individual columns such as index columns or numerical columns. Effective IO concurrency, easy one to set. You have high end storage that can handle a lot of concurrency. Tell the query planner. Default for this is either one or two, I forget which. And that one or two tells Postgres that you have negligible concurrency in terms of being able to access say multiple indexes at the same time, which really plays into things like merge storage. So if you are on a bunch of SSD storage or high end sand or cloud storage or whatever your concurrency is generally pretty good. And in that case, up this, I'd say the useful maximum for this is four, because without resorting to parallel query, Postgres is limited in its ability to parallelize things, so bump that to four. Here's one that is just a just do it and will go away as a problem in Postgres 11. In 9.5 or 9.6, we added this option called Replacement Sort Duples, which was an alternate sort path for on-disk sorts, et cetera, that was faster for small sorts. With general improvements that were made to sorting in Postgres 10, this is actually no longer an improvement. The default in Postgres 10, unfortunately, is 150,000. So 150,000 rows or less, it's going to attempt to use this. You don't actually want it doing that. The number of situations where it's really a problem are fairly small, which is why I didn't put this in my core settings, but if you're going to get into performance tuning, just set this down to zero. In Postgres 11, this will go away because the setting will go away. So, this is another one for, so one of your chronic problems in the world of Postgres, what are our chronic, common, chronic bad use cases? You have a mixed use database that has long running analytical queries on it, and it also has a queue, a Postgres table back queue. It's a bad situation because what queues do is they constantly create and delete records, which means you want that queue to be pretty much being continuously vacuumed by the auto vacuum demon because it will always have more dead rows than it has live ones. But a long running analytical query or any other long running operation will pin the row versions in that queue table so that they can't be vacuumed. And so it becomes a real performance problem because even though the queue only has three live rows in it, it has 300,000 dead rows. And so it behaves like a table with 303,000 rows in it, not like a small table it should actually be. So, our workaround for this in added in version 9.6 is the ability, and this is actually something that people who are familiar with Oracle, this is, Oracle has the same problem, they have the same feature, it is called old snapshot threshold, which this says, hey, I'm allowed to stop pinning row versions after this amount of time. And this actually solves the queue and analytical problem as long as that analytical query is not ever touching the queue table. The minute the analytical query touches the queue table, it will get a snapshot to old error and terminate. But it has to be set cluster wide as a time interval. And I recommend it setting it to something reasonable like an hour, or if you're in a particularly pathological queuing situation, you might set it down to like three or five minutes in order to actually cycle that. And then you just have to be careful in that application that you are not gonna have long running applications that are touching data that cycles really fast because those may get canceled out by a snapshot to old error. But it will, you know, for that queuing issue, it really does help a lot. Questions about that performance tuning before we move on to a few other miscellaneous settings. Anything? You know all the performance stuff? Okay, some file security stuff. If you're running on bare metal or in a container, and your Postgres packaging doesn't do this automatically, I know the packaging for Red Hat actually already relocates this stuff, but like say you're compiling Postgres from source, if you're compiling Postgres from source with the defaults, it drops the Postgres Unix socket in the temp directory as world-rightable. There are a number of hacks that can take advantage of this even if Postgres is running in a Linux container. Because a lot of Linux containers treat the temp directory especially in a way that is less secure. So if you are installing Postgres in a way that does drop it in the default temp directory location, you actually want to relocate it and lock down its Unix permissions a little bit. Not a whole lot, just make it non-trivial to actually exploit it. Obviously, if you have applications that are connecting over that Unix socket, other than backup or something that runs as the Postgres user, they're going to have to, for example, be in the Postgres group. But it's a really good thing to do. So one of the things we're going to talk about is freezing. Now, if you really want to read up on freezing, I have some blog posts from 2012 that I've linked in there that are still largely relevant. We've made some tremendous improvements on freezing, particularly the freeze map that makes it much less costly to freeze your tables. But I still feel like the default settings for a few of these settings are way the heck too conservative in terms of when to freeze stuff and they result in kind of a time bomb problem, where large tables that are infrequently updated end up waiting until you hit this do or die deadline called auto vacuum max freeze age. And then you get an uninterruptible auto vacuum that generates a whole ton of IO. And that you can't terminate and that if you even restart Postgres, it just comes back. So what I recommend doing to try to avoid having that happen is two things. Number one is you can raise that slightly, not a lot, because our hard limit is two billion, because transaction IDs are in force and they're wraparound. So our limit is two billion, at least until Postgres 11, Postgres 12. We may move to transaction IDs that are de facto in date in which case this whole thing changes, but it doesn't change in Postgres 10. So default is 200 million. I recommend raising it to, you can raise it to 500 million, gives you a little bit more time before that becomes a do or die deadline. It's not really safe to raise it further than that, except under an emergency situation because you really don't want to have it even more than a quarter of what is the Postgres cannot start because if we genuinely run out of the very last transaction ID, you literally cannot ever start that database. Because anything you want to do requires a transaction and we have no IDs to allocate anymore. So you never want to get into that situation, so don't set this to two billion even if you can. And then the other thing that you actually want to get into, and then we've got vacuum freeze, table age, vacuum freeze, table age says, if I'm vacuuming a table anyway, wouldn't you want me to do a full scan so I can freeze a bunch of things? Always set that to 75%, 80% of whatever auto vacuum freeze max age is because again, that's an expensive IO operation. You really only want to kick it in if you were looking at possibly doing that anyway. Unless you're setting an individual session, if you have regular slow periods in your database activity, then that's a good time to schedule a manual vacuum and set that with a really low vacuum freeze table age so that it cleans things up proactively. And my old company, Postgres School Experts, has in fact a script up under the PG Experts GitHub report, I don't remember what it's exactly called, that actually helps program doing this proactively. Anyway, then the other thing that you should do is really lower the minimum age. And the minimum age is hey, if I'm writing this data page anyway, how old does the transaction ID have to be before I freeze it? And this thing you actually want to have as low as is reasonable. Like the actual level should be, hey, 15 minutes to an hour of transaction ID spent. Like if you're actually willing to sit there at sort of like median application time and check how many transactions, what's my transaction ID now, what's my transaction ID in 15 minutes to get an idea of what your transaction ID boom rate is. That's actually what we should set it to, try setting to something like 50,000 to start out with that will actually cause a lot of things to get frozen before they're ever flushed to disk. And if you want to know why we don't do this by default, you can find a long argument between me and Tom Lane on PG SQL hackers from about four years ago. But the, and again, if you want to know more about freezing, read my blog post, et cetera. And then hopefully in Postgres 11 or 12, a lot of this will be going away honestly. But right now it is something that you might want to tweak. So one other set of settings I forgot to put in this file. And by the way, does somebody have my GitHub repo open in the browser? Anyone? There's somebody who'll log into that. I want you to file an issue right now saying, but what about parallel query settings? Because I meant to include those in the extra.conf and I forgot about them. So this is looking in that database I distributed to everybody. There is a view called annotated report that joins PG settings to the table annotated, which is my additional notes on every setting, and then presents it in a sort of spreadsheet-friendly format for you to actually look at stuff. And so I'm going to pick some columns out of that. And then particularly I actually want to look at the parallel query settings. So, next worker does not need to do the parallel. This is the Autoback in Max for what I talked about earlier. So, there's a few things setting for parallel. Forced parallel mode is a query troubleshooting thing. You would only do that interactively. That's a, hey, if I were using parallel query, what would this query execution be like? In cases where the query planner is choosing not to parallelize. Usually you find out why parallelization was a really bad idea when you do that. But sometimes you need to numerically know why it was a bad idea. So, the way that parallel query works is parallel post-crisis and multiprocess database. And so in order to parallelize a query, we actually have to spin off extra processes. These processes are known as workers. And for resource management on the system, CPU resource management primarily, we place a number of ways for the administrator to limit the number of workers there are and the default limits are quite low. Logical replication workers, that has to do the replication, but here we go. So, one of those settings there is Max Parallel Workers. And Max Parallel Workers says, across the whole system, what is the maximum number of concurrent parallel query background workers I can be executing at one time. If you are actually going to be actively using parallel query because you are doing analytics on the database, this is real low. Eight is going to be not high enough, except if you only have eight cores in your system to begin with, then it's often not all that helpful to have more parallel work than you have cores. Now in a lot of cases when you're doing parallel query you're waiting on IO8, so it can be helpful to have double the number of workers you have cores because the system will cycle things, but not a lot more than cores. And usually a default is you want less than cores because you're assuming you're going to have a bunch of other activity in the system that has nothing to do with parallel query. But in the case of where this would get high is, this is an analytics database that is only used for analytics and reporting. I am managing the connections to it so I've never gotten more than one or two concurrent connections at a time, current jobs at a time. And so in that case I want to up parallel workers to double the number of cores I have, max system wide, right? Then your second limit is called max parallel workers per gather. Now currently there is only one parallel core operation that happens in Postgres which is called a gather. And the gather is either we're going to take this many fragments of a sort and collate them together, we're going to take this many fragments of a merge and collate them together, but it gives many fragments of an aggregation and bring them together. But those are all done through this underlying operation called the gather. And so right now that's the only thing that has a limit. If we add other types of parallel operation in the future, you may see a second limit appearing here. But this is basically what is the maximum number of parallel workers I can have doing the single gather and the default here is two. So I can only have two parallel threads working on the same sort. Not all that useful if you actually need parallel query. So this is one of the things where we compromise by making it two where the sensible limit is either zero or something higher than two. Because if we need parallel query, you're going to need just to be like four eight or something. And if you don't need parallel query, you probably don't actually want it happening by accident because it's extra CPU load, right? So again, doing an analytics database, you might raise this up to half of what you raise max parallel workers up to. Okay, again, that actually has to do with replication. If you're using max single order per subscription, this is for logical replication on the standby side. It allows you to parallel receiving the replication information in order to keep up. So then max worker processes. So this is the superset of max parallel workers. So the superset of max parallel workers where it says how many workers of any kind am I allowed in the system? Why is the different parallel workers? Well, Postgres actually is a feature called custom background workers where I can, for example, have a backup tool running as a process under the Postgres demon or in interface. Like for example, there's an interesting tool that makes Postgres protocol compatible with MongoDB that runs as a background worker so you can connect to it on the Postgres server. So for any of these background workers, those are extra background workers that have nothing to do with parallel query. So if you have no custom background workers, this is going to be the same number you have max parallel workers set to. But if you might load Postgres extensions that involve other background workers, then you may want to set this to something higher than that so that you always have sort of headroom for the other workers, even if parallel query is maxed out. Now there is a bunch of tuning parameters for parallel workers, which is things like this, parallel setup costs for example, which is how much do I add to the cost of a query if we're going to kick off parallel query at all because we have to instantiate a bunch of connection processes. We need to set up a lot of sort of scaffolding for sharing data, et cetera, that is non-free. And the default here is a thousand. Parallel tuple cost, how much is it to gather an individual row, that sort of thing. I will say absolutely no one has done exhausted performance testing across a variety of different workloads to discover what are reasonable settings for these cost parameters. These are still in the .1 sounds good to me territory. So if you are working in analytics and you are using parallel query in Postgres heavily, you may find yourself monkeying with these cost parameters and playing with them. If you're going to do that, I would say please share your experiments on the Postgres scale performance mailing list because other people are at the experimental stage too and you will both get good information and provide good information to other people in the community. And then eventually that information will make it back into the Postgres default so they will be set better in the future. Okay, we're at 12 o'clock. We have 30 minutes left in the session during which I will cover a random selection of other settings. However, before I do that, any questions on this other group of settings? Parallel query and other things. Questions on how to manage Postgres SQL settings? No? Okay, settings to where it is. You don't have anything left in the slides. I don't think I have anything left in the slides. The grand tour. So if you look at, so again, how do they get repo? There's number of files. If you look at the annotated table in the database or you look at that annotated.csv file in a spreadsheet program, you will notice that I have added a bunch of categories and subcategories for the GUCs that do not match the category column in PG settings. And that's because I feel that the category column in PG settings is out of date and does not have enough different categories. The, and for one thing, one of the things you'll notice in the official categories in PG settings, an awful lot of random settings are thrown in client connection defaults, which is basically the miscellaneous category. I hate miscellaneous categories. It was a database geek for many years. And if there's anything a database geek hates, it's a miscellaneous category. So I've come up with a different sort of schema of categories and subcategories that I feel more completely divides up the setting space into the different kinds of settings that you can do. Obviously no categorization system is perfect, but I think it actually helps you a little bit. So we are gonna actually look at those by category. So one of the things I mentioned, file locations. So in posters.conf, you can actually define a number of file connections for where things are. One of those is super deceptive. The config file contains its own location, which means you can't actually change this. But everything else. So one of those is data directory. How many people here are using Debian or Ubuntu? Debian or Ubuntu do a thing that I actually really like which is that they have the configuration file in a separate directory from the Postgres data. This is the way I set up systems on my own on Red Hat which puts them in the same directory because the data in the config really should be under separate management for obvious reasons, right? Data is managed by replication and backup, et cetera. Config is managed by configuration management systems. They should not be in the same directory. So what you can do is you put the configuration directory in somewhere and then you set this data directory parameter. It's not already set by your installer to a different location. Other things are a little more special purpose. External PID file which was the one that we just had above was a way of actually setting an extra process ID for the Postgres master because you, the default one goes in PG data for Postgres' own monitoring of is the master still work, is the master process within that individual Postgres is still running. But sometimes your system like system D also wants to keep track of the PID and it wants it in a different location. So like if you look at a Red Hat install, that variable is actually set because we need that second copy of the PID for system D. DB user namespace, never use it. It was a hack we added to like Postgres 8.4 and can't seem to get rid of. Anybody use in Kerberos authentication? Or GSS API which is the successor in Kerberos. One subset is associated with that if you were. I already told you about the SSL settings where you have to actually set a bunch of these things. I really hope to the gods nobody's still using Bonjour. Don't understand why we still have parameters for this. I guess we have a bunch of people who are still running Postgres on Mac 10.2. And for that reason they need Bonjour. But, Mac's connections port, you can change the Postgres port. Sometimes useful, particularly you need to do this if you're running multiple Postgreses on the same machine. Although the port is one of those things that I often recommend managing by command line parameter instead. Because you may want it to run in a different port dynamically depending on how many Postgreses you're starting. Super user reserve connections. This is one of your save your but things. Which is if you've run out of connections on the server because you are having an application problem where there's lots of idle connections. And so all 200 connections are used. How do you log into the database to find out what those 200 connections are using so you can selectively kick them off? By default, three of those connections are reserved for only super user connections. Back up rings to the super user. You may have a partition management tool that's running as a super user. You may have a monitoring tool that's running as a super user. If you have all of those things then you maybe need to up super user reserve connections to five so that you don't end up in a situation where you're locked out. One over unit sockets already, authentication timeout. If you are a network geek you can actually play with TCP keep alive so that you can have the lowest possible interval between an individual Postgres server going down and the other Postgres servers, replicas and masters and stuff knowing about it. But this requires you to actually know enough to play with this because you will also need to play with the TCP settings on your host system to make this make any sense. Parallel I went through the max worker processes, dynamic shared memory type, Postgres installers should deal with this automatically. Huge pages. One of those goofy things, generally you can trust the defaults here. This is a Linux specific thing. Usually you're gonna be using huge pages because you're gonna be running Postgres on a system that has more than two gigabytes of RAM. But if you're actually running Postgres in a system that has less than two gigabytes of RAM because you're running on an Amazon instance, for example, there might be some tiny performance gain in turning huge pages off. Again, only do this if you're willing to actually check for performance improvement on those circumstances. But we went over all the memory settings, max prepared transactions. Oh, I have not updated this for Postgres 10. The default for this is now zero as it always should have been. So you shouldn't need to touch it. This is if you are doing three phase commit across multiple Postgres systems. If you need three phase commit, you will know it because you'll be using some tool like a cluster management tool or distributed transaction tool that requires it. Otherwise this needs to stay at zero. The max files for process, one of those things you adjust when you get a certain error in the logs as in, you know, exceeded max files for process quitting. Although again, and the reason why you touch this is imagine that you actually have a 11 terabyte database where every file, every table is broken up into 100 partitions. You can actually pretty easily exceed touching 100 files in an individual transaction if that happens. Shared preload libraries. This is primarily for loading, preloading Postgres extensions. Some of the extensions require you to use shared preload libraries. Like for example, if I only installed the Citus extension in Postgres that allows you to start at Postgres, that requires loading Citus as a shared preload library. And then you have to restart Postgres. That's a list. Temp file limit we already mentioned. Wall and checkpoints. Min and wall size, max wall size. So this is our new easier way to govern the transaction log. Usually the minimum of 128 megabytes, no, 80 megabytes and the maximum of a gigabyte and then Postgres dynamically allocating between those works great for most people. Most people don't need to touch that. However, couples are doing to touch that. One is, if you are literally writing more than a gigabyte of transaction logs per minute, then you actually might want to increase that maximum. And sometimes that happens because you have a whole ton of transactions because you're doing 10,000 transactions per second, right? That's your rate more than a gigabyte of transaction logs per minute. Sometimes it happens because you're doing massive bulk loads. And during the bulk load period, you can write a whole ton of data, right? But in either case, then you might want to up the maximum size. The reason why you want to up the minimum size is if your workload is super bursty. Because then if you have like several minutes of zero stuff being written and then all of a sudden we go up to, hey, I'm going to write 160 gigabytes of data and then I'm going to have zero stuff being written for half an hour. If that's your workload, then you actually want to increase the minimum because otherwise what's going to happen is the transaction log manager during your idle periods is going to shrink the transaction log down to the minimum, to 80 megabytes. And then it increases it by running out of space. But while it's doing that running out of space, it's cycling the transaction log a whole bunch which slows down your write rate. While compression, we went over, archive commit, archive mode, we went over. Those flash settings, I mentioned those are new. I don't have any tuning information about them. If you want to mess with those, which you might if you're having all kinds of IO bottlenecks, then hop on Postgres Performance. Full page writes, has to be on for replication. If you're not doing replication, you could conceivably turn it off to lower IO but then you're increasing your danger of database corruption. So just never turn it off, synchronous commit, we went over, well sync method, gets set automatically to install time, pretty much never have to touch that. Wall writer delay is a weird one for people who actually deliberately want to have replicas lag behind, which is lots of fun. The, otherwise it's, well no one, I'm sorry, this is the wrong delay. That was, that's wall receiver delay. Wall writer delay is to actually help the wall buffer work better for replication. And I believe that 200 is 200 microseconds, which means you shouldn't really have to touch that. I think it's microseconds actually, but I would need to double check that. So new setting. We used to have, so Postgres as other IO constraint or transactional databases, spends a lot of time tinkering with group, what's known as group commit, various methods of what's known as group commit. And group commit says, hey, if I've got these six writes in different sessions that I was going to do at the same time anyway, within like 100 microseconds of each other, it would be really nice to send them to the operating system as one big batch write so that if some of them were hitting the same disk pages, it would be one write instead of two. And plus we can minimize seeking behavior that way, right? Because if you send them as a batch, then the IO scheduler and the operating system can do a better job of collating them for efficient use of your IO resources. So our first attempt to do this was something called commit delay and commit siblings. I feel that commit delay and commit siblings have been completely superseded by the flush settings, by the checkpoint management settings, and by a synchronous commit. And thus they're absolutely not useful now. But a few people are really addicted to the setting that every time anybody brings up or moving them, they say, no, I still use those. So I am not recommending to use these. I recommend that you set them at the default, which is to not use them. Checkpoint completion target says, hey, from a scale of 0.0 to 1.0, how much of my checkpoint timing interval am I allowed to use for a flush? And the default is one is 100% of it, as they gradually sink over the entire five minutes or 10 minutes of the checkpoint time, which is really what you want to have happened to minimize IO impact. And default checkpoint timeout is set in seconds, and that is five minutes. Works for most people. Sometimes, for example, if you're doing a lot of bulk loading, it can be helpful to actually increase this to, if you're doing a lot of bulk loading and you have a lot of memory, because you actually want Postgres doing its own checkpoints before the OS runs out of file system cache and starts forcing a flush, because the OS tends to be a lot less efficient about flushing stuff out of memory and tends to do things like stall your entire IO queue while it's doing the flushing. And so you actually want Postgres to be doing the flushing and not the OS to be doing the flushing. So if you're going to increase this timeout, you have to be reasonably sure that you have enough memory to hold all those dirty pages. So you're not triggering an OS flush. The background writer is something that actually goes through and flushes Postgres' dedicated cache, the shared buffers cache, and flushes data from the shared buffers cache in order to avoid running out of space in the shared buffers cache. There are a number of settings for this. These settings have been tuned over a number of Postgres releases to the point where I actually don't have good general recommendations for how you would tinker with these. If you're tinkering with them today, you're tinkering with them to address a specific problem and that specific problem is going to be based on loading the Postgres buffer cache extension and looking at your actual buffer cache behavior in order to figure out where your problem is. Otherwise, you're not going to touch this. So I don't have general advice for when you would actually tinker with a page writer settings. Oh, except under one circumstance, which is if you're running an ephemeral Postgres node, then you'll actually set a BG writer LLU max pages to something really super high in order to avoid the BG writer from running at all. But it's a ephemeral setting. So replication thoughts went over, all the replication settings went over. So there are a bunch of things for tuning the network activity of replication. This is, again, very fine, great stuff. The settings, they didn't go over mostly, hey, how often should the standby pole, the master, et cetera? Again, if you're going to be messing with these, you're going to be addressing a specific network and timeout situation. So there aren't general setting recommendations for these. The settings are there if you do need to tune those. I did mention that there are other cost contents you can set. One of the ones that people mess with sometimes is this one, sequential page cost. And that sequential page cost is, what is the cost of a single table scan tuple? And that is the base number that all of the numbers are multipliers of. And the main reason people changing this is, they are trying to get that cost number that the query planner reports when you do an explain to match up with time in milliseconds. I think this is a fool's errand because it never actually works. And in the meantime, you're making your costs look all weird. But if you do find, if you inherit a Postgres database and somebody's changed sequential page cost, that was probably what they were doing. Now there are a whole bunch of settings, all of these enable settings. And these enable settings are for interactive query tuning. So you do an explain. The explain is really slow. You see it's doing a huge merge join. You say, hey, I wonder if that were not doing a merge join. If that was doing like a hash join or a nest loop instead, would it have been faster? Well, I'm going to turn, I'm going to hit enable merge join equals off. If you are setting these enable settings in the postgres.conf file, you have done something wrong. And that's probably an indication that something else is really out of whack in your database. Like effective cache size is set to one megabyte or something else in a really gross statistic is bad. And people are trying to fix that by doing these enable settings. GEQO is the genetic query optimizer for queries that are too complicated to plan exhaustively. I'm just not going to get into that because we only have 10 minutes left. I have gone over the parallel settings. And then we've got a whole bunch of other cost settings. Again, you need an education on the cost planner and looking at a lot of the query settings to actually look at how to set these. One that is actually kind of interesting is constraint exclusion. This actually enables something. So constraint exclusion says, if I am scanning a whole bunch of tables that are going to be appended to each other, can I drop individual tables from that if I don't think that they have any rows that comply with the query? And the main time that you do that is if the table is partitioned. Because one of the big query optimizations for optimizing a query is to say, hey, this table is partitioned. I want to drop partitions out of the scan if I don't think those partitions have any compliant rows. Usually based on table constraints. And that's why it's called constraint exclusion. On defined set or constraints on each partition segment. So for that reason, this is set to partition by default, which is what most people want. If you set this to on, then this also applies to union and union all queries. Which if you do a lot of union and union all and intersect and accept queries, can actually be useful performance optimization. Otherwise, leave this to the default. Oh, and if you think that there's some sort of weird error in constraint exclusion, obviously you can try turning this off interactively and see what happens. From collapse limit and from joint collapse limit are the opposite of genetic query optimization which they say, hey, part of the exhaustively planning queries is to say, hey, I'm joining seven tables and I can actually join these in any order. But the problem is that that's not unlimited. This is a joining tables to traveling salesman problem. So if you said, hey, I have 130 tables, I can join these in any order. Well, in that case, you better be available to check out that quantum computer that IBM has. Because otherwise that query plan is never going to finish. I actually kind of feel that the default limit is kind of low as in based on current processor speeds and that sort of thing, this could be 12. But not a lot higher than that. And frankly, most people are not doing more than eight table joins anyway. But if you do a lot of analytics queries, consider raising this something like 12, 20 is going to be too high. Do-do-do, let's see, anything else that's interesting because we only have about five minutes left. Stats temp directory, this is actually a useful performance optimization and I may throw this into extra.conf in the future. It has external dependencies, which is, Postgres maintains a buffer file of changes to table stats, like how many rows you have and how many updates have been made and that sort of thing. In order to batch writes to the permanent stats file. Because otherwise writes to the permanent stats file would become an IO problem. Normally this file is a file handle located in the PG data directory itself. Which means that periodically the operating system will flush it to disk when it accumulates a certain number of 30 pages. If you have a very high write rate database, you can actually improve performance by moving that to a tempFS. Because if it's in a tempFS, the operating system will never flush 30 pages to disk. And it doesn't matter, it's disposable data. This is just a buffer. If Postgres crashes and you lose the data, it means that you lose the last 10 minutes of updates to the Postgres stats. So if you are in a high concurrency, high write rate situation, do consider moving this to a temp file and then changing this variable to tell Postgres where to write it to. Okay, I passed that. Track functions, that actually logs statistics about execution of Postgres functions, stored procedures. If you do use a lot of stored procedures, you might actually wanna turn that on. And by turn that on, turn it to the setting PL, which means user stored procedures, as opposed to the stored procedures that are built into Postgres as part of the system. Because if you just set that to on, it will log every stored procedure execution, including Postgres system functions, which generates way to hell too much output. These are auto vacuum tuning parameters. Again, I have actually tuned these for cases where the defaults feel a little bit conservative, et cetera. You're not gonna deal with them most of the time, particularly as Postgres improves, we have auto vacuum do less and less work and so tuning it becomes less and less important. And let's see, we'll do whatever the next setting is here. That is interesting. So, auto vacuum makes workers an auto vacuum nap time. These two things go together. And again, like I said, some people have situations where they have literally thousands of tables. A common thing for this is a database-backing multi-tentate application in which every tenant gets their own set of tables. Common pattern can result in you having 300,000 tables. Maximum I've ever seen in production was 2 million tables. Although Alvaro, Herrera proved that you can't actually do a billion tables in Postgres. That's the practical limit because we run out of file set of handles on the operating system. But if you have a lot of tables and buy a lot, I mean hundreds, I mean thousands, then you're gonna need a lot more auto vacuum demon spawning. So then you might raise auto vacuum demon, auto vacuum max workers to five or 10 or 12. Again, also depending on the number of cores you have because these are parallel processes, right? The other thing that you might do is normally what these auto vacuum workers do is they do some work, they vacuum part of a table, and then they go to sleep. And they go to sleep is defined by an auto vacuum nap time. And if you find that auto vacuum is not keeping up because you have a lot of tables, it can actually be better to decrease auto vacuum nap time down to say five seconds than it is to raise the number of parallel workers. Particularly because the parallel workers tend to start about the same time. So if you have an auto vacuum nap time of 60 seconds, what can happen is if those auto vacuums are short is suddenly you've got eight parallel workers and then for 40 seconds none. And then for 20 seconds eight parallel workers and then for 40 seconds none. This isn't actually a good use of resources. And so that is all we're going to get through in the extended ones. But I think I covered about 75, 80 of the settings. And I covered all the rest of the settings in writing and they get a repo that you don't have. So we're going to take five minutes for any other questions you have and then break for lunch and then the thing is over. So go ahead. The question was, is there a specific functionality that huge pages are providing that transparent huge pages doesn't provide? And I will admit here that I've never really understood the difference between huge pages and transparent huge pages. So I don't have an answer to that. If, I don't think Robert Hoss is at this event. I would say ask somebody else in the Postgres room we could bug Bruce Mamjan. So I don't know the answer to that. Obviously somebody thinks so or we wouldn't still have it because it is really only next only setting but I don't know what the specific benefit is. Any other questions? Nothing else? Nobody's looking at their Postgres.conf and saying what about this setting? Or is this too much of a brain dump to keep up with? Okay, yeah, go ahead. Yes, yes, yes, yes. So there's my contact information in general. Whoa, somebody just leaned against the lights. Okay, there's my contact information in general. And that includes again that GitHub repo the annotated.conf repo. So clone that or download it as a zip file or whatever. And of course you can get in contact with me. Shortly after this conference I was making a post to database soup about the updated annotated.conf. I will probably add a couple of things based on this. I want to add the parallel query settings in there and actually I want to add the tempstats file because I forgot to add that in the extras as well. So I'll be adding it. So there will be another commit on that repo soon so you can just actually pull a new one. I do work for Red Hat. I'm actually on the community team at Red Hat where I work on Kubernetes, which is my thing now. So Red Hat is always hiring. So if people are interested in working on lots of things particularly in the container arena these days, go to Red Hat jobs. Otherwise go to Red Hat community to see all the things we're doing in upstream open source projects, which is a lot of them because everything at Red Hat is open source. So thank you very much. Yes, yes they are. They are open office format. Not well, Libre office format. So we'll get started in a minute or two. I'm back from lunch. Can I speak to the room host? Is he here? All right, good afternoon everyone. My name is Bruce Momjum and I'm raring to go. I am one of the Postgres core team members and have been working with Postgres since 1996. I work for EnterpriseDB, one of the Postgres support slash product companies and I'm excited to be presenting to you a tutorial about comment table expressions for the first half and window functions for the second half. This is a three hour tutorial, goes to 4.30. We will take a break in the middle, probably between the two presentations, pretty much 20 minutes or so and kind of give you a little chance of decompressing as we move from one topic to another. The reason I'm excited about this talk is it takes probably the most, the two most powerful sort of advanced SQL features and presents them in a single talk. This is an excellent talk for people who are doing a lot of development work using databases, application development, data warehousing, analytics, that type of thing and gives you a good sort of launching pad to understand how you can make the best, the most best use of your SQL database. Now, all the examples I'll be talking about actually are related to Postgres. We've had these features for several years and in fact, probably five or six years. I have been told this morning that MySQL implements a lot of these features as well. In fact, Dave, can you address that at all? Okay, so I'll just repeat what he said. He said MySQL has had CT since when? I'm sorry, 5.8? Ctes or version eight or window functions? Window functions and CTs both in MySQL version eight. Okay, great. So, as I understand you use Postgres' features as an example, so what I'm showing you should work in both databases. And in fact, these are SQL standard features. So, if you're using any SQL standard database that implements the SQL standards for these particular options, you should be able to use them very capably. Yes, sir? Thank you. Thank you, Dave. So, again, this is really two talks. Again, both of them have a similar theme of making the most use of SQL and the current sort of standards that have been adopted in recent years. The slides that you're going to see are actually on my website. So, this website right here is actually a place you can download the presentations if you have a desire to do so and I'm going to go down for a minute. And I guess that's just, let me see if I can focus this a little better. It's a manual focus, I guess it is. Oh, that's just the zooming. Okay, anyway. That's probably as good as we're going to get. But again, feel free to download these presents to this presentation as well as the other one. There's really 20 or 30 presentations here on that website related to Postgres and also there's some security talks and many of them have videos. So, if you find something interesting, you may find there's a video that goes with it and you can hear me droning on and on about these particular features. Again, we will take a break between the two presentations and again, this is ideal for developers. Now, there's two ways I can present these talks and I'm going to take a little poll of the audience to decide. One way of presenting this talk is for me to be speaking and you to do Q and A. I answer questions about how certain things are handled or how would I implement certain things. In that particular example, we would basically talk about some slides, again, take questions as we go, talk about some more slides, take questions as we go. We wouldn't take any breaks for you to actually practice some of these features. Okay, so that's the first one. Where we do mostly Q and A, we don't take breaks for you to actually practice. The second option is where we will take breaks after each section, maybe five, 10 minutes to give you a chance to actually practice the features that you've already seen. Now, of course, because these slides are already online, you can almost cut and paste the examples into PSQL or the PG admin for you to kind of practice it. So, the question I have for the audience that I'll look for raise of hands, the first option is mostly Q is all Q and A with no breaks for actual hands-on. The second one is for breaks for hands-on. So, how many of you would like no breaks for hands-on to do all Q and A? Okay, that's about 10. Okay, how many of you would like to take breaks so you can practice what you're doing? Okay, I think the first group one, by a slight margin, sorry for that, what we will do is that 20 minute break that we take in the middle, that'll be your chance to practice the previous section and, of course, I'll be here at the end. If you want to take another 20 minutes after to practice, I'll be glad to stick around to do that. So, that's how we're going to kind of work it. Again, some groups, I've done this twice, one group shows the one almost universally and the other group shows the other universally, but again, it has to do with type of audience you have. Again, this is a tutorial, so any questions you want to ask? Feel free, yell out, raise your hand, whatever you want to do. Does anyone have any questions before we get started? You can't find the presentation. Okay, so if you go here, this is actually under SQL. There's an SQL sub-tab under presentations. That's exactly what it says up here. Right, okay. And again, the reason for the confusion he's right is that when I took the two presentations and combined them in the tutorial, I couldn't say programming, yes or comment table expressions and window posters, window function magic, it would just be too long. In fact, if you go look at the screen, the title doesn't fit on the screen of the scale, then it just goes to dots because it's too long. So that's why it kind of got compressed. Other questions, that's a great question. Okay, I do think you're going to get a lot out of this. I really enjoy these talks. These two topics are really kind of close to my heart in terms of really empowering developers because a lot of developers have a tendency to look at databases as a key store or as just a data store for them. And when they approach a database, they're just basically thinking, okay, I'm going to put the data in, I'm going to get it out. And everything that I want to do, I'm going to do in my application. That is the way we did it in the 80s and 90s. That is no longer the most efficient way today, but that might be the way you did it in the past or that might be the way you were taught in college. Okay, or you might have worked for an organization that did it that way. And frankly, if you're using a relational database as just a data store, you're really not getting the full benefit out of the database. And the reason is because these databases now, we're using the current SEO standard, are really giving you a whole tool set of things that make your life easier as application developers, okay? It doesn't make sense for you to re-implement a whole bunch of things in an application and perhaps implement it over and over again in different languages and different applications, when you can actually structure what you want as an SQL query and give your problem to the database. Right? Give your problem to the database, let the database solve your problem and give you the results that you want. And that's I think the major theme I'm gonna be talking about not only with commentable expressions but with window functions, the idea that you can now give your problems to the database, press the SQL in a way that it understands and let it give you the answer that you want. That makes you more efficient, that makes your application less bulk prone, that makes your application smaller, that allows you to get more work done. And I think these two topics really work well in that kind of thing. So being prepared as we go forward to say, oh, I didn't know I could do that or oh, I've already done that in my application, now I don't need to do that anymore, I can rip out all that code and actually express it in SQL, give it to the database engine, let it give me the results that I want, make my life easier. Right? The reason relational databases have been around so long is not because everyone likes to spell out the word between or select or what somebody had a really long one. Some of the key words are really long, nobody likes that, but it is a very expressive language and it's a declarative language which I'll talk about in a minute, which allows application developers to be very efficient. And that's really where you, that's where you distinguish yourself as an application developer from someone who's just a grunt who's gonna throw the database, data into the database and pull it out and then go through all sorts of gymnastics in your application and somebody who really uses that database for what it's good for and gets things done well, okay? And I think that is a real area that we're lacking and I think these two talks are gonna help with that. Let's get started. So the first thing we're gonna talk about is the concept of imperative versus declarative. That is sort of a general concept that I think is interesting. Then we'll talk about the syntax for common table expressions. Then we're gonna kind of go sort of up the ramp, get harder and we go to recursive common table expressions. This is where the SQL starts to be programmatic or it allows you to sort of do things, push a lot of your work from the client application into the server, okay? And then section four is where we kind of hit the cruise control and we go through a whole bunch of examples using common table expressions, showing you how you can do a whole bunch of problems again using this feature. And then finally, section five, writable common table expressions. This part is probably specific to those graphs but it allows you again as an application programmer to bundle a whole bunch of queries together and issue them as one statement. And I'll talk about why that's valuable and again I think you'll be impressed. Any questions? Great, okay, first imperative verse declarative and we have a gentleman with a baton kind of instructing the quartet here how to play the music. Imperative programming languages, these are languages you're probably familiar with, C, Java, Perl, Python, JavaScript. These are languages where you tell the computer what to do. You are instructing the computer, conducting the computer, go do this loop, you know, assign this value, here's the sif statement, do this other thing and you're telling the computer what to do. And this is the way most applications are written in imperative languages and again that's the methodology that you're normally thinking of as an application programmer. I need to do X, Y, Z, how do I get that done, okay? However, SQL is not an imperative language, it's a declarative language. What is a declarative language? A declarative language is one where you tell the computer what you want and the computer decides how to give you the answer in the most efficient way. Doesn't that sound great? All of a sudden you're not having to go through every little if statement and every little conditional and you know call your functions and all that other stuff. You're actually saying I want this, you figure it out and there's a whole bunch of reasons that's great particularly because the system knows the data layout, the system knows how common certain constants are that you use in your query. I'm not gonna get into that, I have an optimizer talk that explains that but the server, the database server knows a whole lot more about the data than your application does and it is able in many cases to give you your results much faster than if you took a baton out and tried to instruct what you want the server to do and that is why SQL is so powerful, that's why it's been around for so many years and that's why it probably is gonna be around for another 30 years or 40 years or 50 years at this point, 40, 40 years. So this is typical imperative language. The first example is in basic, I'm showing my age here, that's what we used to do. If you ever want a program just says hello over and over again, that's what that does. There's an example in C, there's an example in Perl. You're actually in telling the computer what to do in this case, say hello over and over again, okay? This is a declarative language where you're just saying give me hello and then give me another hello and so forth. It's not easy to do an infinite loop like I could in simple SQL become table expressions make that possible. I'll be showing you that in a minute. So just be aware, imperative languages, typically applications are written in imperative languages, okay? Client applications typically use imperative languages. Service-side programming, typically done in imperative languages. And the one place that you can do in imperative languages in a database is using comment table expressions. The reason comment table expressions are a new way of doing SQL is because it takes a declarative language, which is SQL, and it gives you an imperative way of controlling it. So imperative is not bad. It's just that a lot of cases, declarative is easier for you as an application developer. There are a bunch of cases where you need to comparatively control that database. And that's what comment table expressions give you. It's a way of almost taking SQL and programming it similar to how you'd program in a language. And I will show you a whole bunch of examples where it is actually tremendously powerful to be able to imperatively control SQL, which by nature, normally, by default, is declarative language. Any questions? Okay, great. Let's look at the syntax. This is the comment table expression syntax. Looks very simple. Effectively, the word that always starts a comment table expression is the word with. So the word with basically indicates I'm about to start a comment table expression. Then you have an optional keyword called recursive. I'll talk about what that means in a minute, okay? And then we have an optional query name, and we have a query name, we have the supply, then we have an optional column name over here, right here, then we have the word as, and then we have effectively a select inside this comment table expression. I realize it doesn't make sense now. Give me a minute, I'll show you some really good examples. And then once you're done all that with stuff, there's a final select down here, which says here is my main query. So every comment table expression you're ever gonna see, always gonna start with the keyword with. It's gonna have multiple, one or more subqueries or sub-selects or sub-queries, sub, I don't want to use the word sub-select, with selects inside, and then there's always gonna be one at the end. We're gonna call it the main one. So you're gonna have some ones at the top, and then you're gonna have a main one at the bottom. That's always the structure you're gonna have. Okay, any questions? Okay, this slide. These queries are very complicated. So keep your eye on the red parts. If you see a query and you're like where is, where should I be, where should I be looking, look for the red part, okay? That's what that slide's for. Okay, so here, oh, we got red part right there, see? So now all of a sudden you see it, you're like, oh, that's who I wanna look, I wanna look at that red part, okay? So here is our first SQL comment table expression, right? Starts with the word with, remember I told you that. It has a query inside the with, and then it has a main query here at the end, right? Follows our syntax or keyword with, it has one or more of these selects inside here and then a main query at the end. Now, what can we learn from this? Well, first off, we've named this as source. In fact, this comment table expression right here is called source, and in fact, I'm referencing the same name down here in my main select query, and this is a pattern you're gonna see over and over again, okay? You have a comment table expression, you give it a name, and then in your main query, you reference that name. Now this is a completely contrived example, makes, you would never do this, but it is helpful to see what it does. You have select one is my sub query, okay? So a row with one, it's called source, and my main query, what is it doing? It's saying select source, and what do I get out? One, because that's indeed what my queer sub query said. If I had a two here, a two would come out there, okay? Again, we're gonna start really slow, and then we're gonna work up, so you don't think this is a baby class, because by the time you get done, you're gonna be like, your brain's gonna explode, right? But right now, we gotta start slow so we can get up to the complicated parts, and the window function's the same way. First 30 slides, you're gonna be like, wow, we get everybody getting useful, and then all of a sudden, it starts to be like, whoa, where did that come from? And then to the end, you're like, wow, that was crazy. I've heard that many times. In fact, a lot of people said that we thought we knew these features until we saw these talk, and we're like, okay, now. Now we know it, we thought you knew it before. Another thing I should point out is that this talk and the window function talk were both written first as SQL, so what I did is I wrote the SQL I wanted to show, I ran it, and then I copied it into the presentation. So this URL right here is actually the SQL from this presentation, so if you download that SQL and just run it in PSQL or in Pjam, and you'll see the whole presentation fly by on your screen, okay? Just hit the pause button, but that's the entire presentation right there. And again, if you go to a website, you download the slides, you just need to click on that. Everything in pink, you click on it, it opens a browser to that URL, okay? So you don't even need to copy it, just click on it in the PDF, all right? Any questions? Okay, here's a new one. So, same structure starts with width, has the outer query of select, uses the same name, but what I did here was I labeled the column. Do you remember, I'm gonna go back to the previous slide, do you see how this has this weird, print-seek, like, unknown column thing? It's like, hey, you didn't give a label, so I'm just gonna give it this question mark, okay? But over here, I actually named the label, so now when I do it, I actually get a label here, and I can actually reference source.call1, and in fact gives me a thing. I can't do that here, I can't reference the column name, but they didn't give it one here. So I just get this amorphous question mark or something, okay? But again, this is normally the way you would do it, you label your columns so you can reference them in that main query. Here's a new one, instead of using the label here, I put it up here. You might remember that from the syntax. So instead of putting the label in the select, I put the label in the common table expression. Either way, doesn't matter, I get the same result, right? Make sense? Third one, this one, we're gonna basically, we're gonna basically explore it, we're gonna push every feature, label feature we can, okay? So we're gonna label it as call1 in the query inside the common table expression, we're gonna make the common table expression call and call2, and then in the outer query we're gonna call and call3. Notice, I had to call and call2 here because I had to reference this one and then it actually comes out as call3, okay? So I'm going from call1, call2, call3, same data, I'm just labeling it three different times. All right? So this is very, I think it's illustrative. Again, we start really slow, okay? But we're gonna get to some really crazy stuff. If you don't see crazy stuff by the end, I want you to tell me. Now, we can do more than one column, we can do select1, 2, and now we get two columns, right? There's no, we're not limited to one column, we can do two columns just fine, all right? Any questions so far? Okay, I know, it's slow, we'll get there, okay? I wanted to give you a little refresher on union because union ends up being used quite a bit with common table expressions. You'll see why in a couple of minutes, but union is kind of our glue to kind of bind some of this stuff together, all right? Union's been around for a long time. Again, just added, it just was very convenient to be used with common table expressions, okay? And what I want to remind you is the handling of union. And I think this is sort of an unfortunate but probably logical behavior, but when you use union on its own, it will automatically remove any duplicates. And I'm sure the SQL Standard Committee has wonderful reasons for this, but it is kind of annoying. Here, I'm selecting one comma one and I only get one out. If you use union all, you don't get the duplicates. So just a reminder, if you ever use union, the first thing, if you type the word union, this is the way I do it. Every time I type the word union, I have to think, do I want duplicates or not? And the big complaint I've heard is normally in SQL, you add keywords to add features. Here, you have to add keywords to remove functionality. You have to add the keyword all to remove duplicate removal, which is not super logical. I understand the SQL stand, why they did it that way, but practically it is, it comes to this. Just be aware of that. Union all, in most cases when I'm writing SQL, I want union all. And when I was an application developer, I remember one or two times when I'd be called in and they'd say, this report's not showing the right number and I'd be like, okay, I'll look at it and I'll come back like, you know, two hours later and I'm like, I'm sorry, we had two customers, they had the same amount and the dupes got removed and then what I do to solve it, it came out wrong. Right, imagine, because you have two counts the same, you know, so anyway, be aware union all is something you often want to do. Oh, so here's a new one. Now I mentioned that you can have more than one common table expression and this is our first example of that. So we now have a select that returns one, two and we have a second select that returns three, four. The first one is called source, just like the previous one. This one is called source two, okay. So we have select star from source, union all. I didn't really need the all here, but I just did it because I didn't want duplicate removal more of a note to myself and I say select star from source two and I have one, two and three, four, okay. So I now have taken two common table expressions, pasted them together with union all basically, all right. I can use common table expressions with actual tables because I have a constant set to this point. So here I'm saying select language named language role from PG language and I'm doing a join here, okay. And then I say select star from source and I get internals and I, basically this is completely useless. I'm taking a query and I'm giving it a name and then I'm just selecting the data, right. Not doing anything, but hey, it's a good illustration. This is the first one that actually does something with common table expressions, praise God. We have a query here, we call it source, we're doing our join, we're getting the language names. I'm saying select star from source, union all. I need union all here, all part. Select the main language name null from source. So the first four rows are these, okay, ordered by language name and then I have an additional role that I didn't have before, all right. I don't think you'd ever want to do this, but if you did, this would make sense, right. You can kind of see I got my query, if your query could be huge, this is just a baby query. I could have a huge query and now I can do like different analytics on it and return different rows of different types of data in it and I'm only running this query once. That's one of the key things is the way Postgres implements this, that top query is fully run before we do the bottom part, okay. So if you've ever had a case where you have to take a big query and run it one way to get one type of answer, another way to get a different type of answer, this is your solution, put that big query in a common table expression and then you can do all sorts of manipulation with it on your own, okay. Here's another one. Here is a select from PG class and now I am actually calling it class and now I'm gonna join this result to PG attribute and then return five results, whatever, okay. But here's the first case where I'm actually joining to a common table expression as though that common table expression was a real table even though it isn't. Okay, another refresher, case statement. I'm not sure how many of you have used case statements before but a case statement is basically a conditional that you use in SQL. So for example, this case statement right here is gonna print positive, zero or negative depending on the value of call. Now, this is a stupid example but again it's like an if statement for you, yes sir. Yes sir. Yeah, so this is an excellent question by this gentleman, he's saying we have a limit five down here but we don't have any limits up here. So how would you actually have to execute this entire query instantiating memory or still a disk and then do the join down here, the answer is yes. We, the common table expression is what we call an optimization barrier that is actually good in some cases and dead in other cases. There's some reasons why we have to make an optimization barrier for some of the later features we have. We've talked about removing the optimization barrier which would allow the query to move around. We haven't implemented that yet but I think one day we will. We've debated a couple times and I think we're getting closer to doing that but the idea of an optimization barrier in this case is what the problem is and in fact the operational problem is you do not know, well I know they're all gonna join, right? No, actually I don't know they're all gonna join. Yeah, I do know they're all gonna join. So I know they're all gonna join. So effectively I could probably move the limit up there as well but if I didn't know they were all gonna join, if I put limit five up there and only four of the rows join, that only returns four. So yeah, this is a problem, great problem. Another thing that actually people have asked about a lot is is there any overhead to these comment table expressions? Not really, if you realize they're an optimization barrier there's really not a whole lot of downside to them. Same thing with window functions, there's really not a lot of performance impact or a reason you would avoid using them. Except again because of the optimization barrier issue. And just be aware of that. Okay, other questions? Yes sir. So what is the difference between having a view versus a comment table expression? Great question. And this is actually, we could illustrate it right here. If class was a view, we have a, well let me back up. So queries in Postgres go through four stages. Parser, Rewriter, Optimizer, Executor. In the Rewriter, second stage, we take views and expand them into the places they're referenced. So what would effectively happen here is this query would get basically expanded like a macro right at that spot. And we would be able to optimize and remove and move around ware clauses and stuff like that in this particular case. So I would definitely argue that view is gonna be much much better in terms of optimization. Great question than trying to do a comment table expression here. Yeah, in fact people often use comment table expressions when they want the optimization barrier. They may want operationally that one query to run before the rest of the query. But if it's a case of just I wanna encapsulate this thing, you're better off using the view. Yeah, because in a view we don't have the optimization barrier and we can move around stuff and we get much better query plans. Great question, other question. So again, back to this conditional here, the case statement, it's basically like an if statement. And you're gonna see some examples of this. Okay, any questions? I actually just added this slide. If anyone wants to explain why this slide is recursion during the break, or is that anyway? It's actually a reason why this is actually a perfect recursion example, yes. So the question is, would we conclude that views are better only because we haven't yet used any complicated things? And the answer is no. In every case of view is going to give the optimizer more opportunities to move things around. The reason where you're gonna use, and again what it is is you haven't seen yet why this would be useful, right? And you're gonna see that in these coming ones, right? But effectively, the reason the common table expression is useful is because it gives us that imperative control. And we're gonna see from here on, a whole bunch of cases where that imperative control is crucial to doing what we need to get done. Up until this point, you're absolutely right, there's been absolutely no reason to use common table expressions in any example that I've given you. Thank you, very good. Other questions? Great, okay. Okay, so recursive, again this is where we start going up the hill, as I said in the beginning. A recursive common table expression is one where you are potentially executing the SQL multiple times. You don't have to, but you're potentially executing the SQL multiple times. Now in this case, you don't. I've added the keyword recursive here. You might remember it from the SQL syntax I showed you earlier, remember that? Okay, so we have a keyword recursive, but effectively, it's the same query, right? We have select recursive source as one, and we turn one, right? But not very powerful, okay? Is kind of where we start to go way up, okay? This is where we actually see a common table expression that effectively goes into an infinite loop. Now, I don't recommend you write common table expressions that go into infinite loops. There's probably little value to that, okay? But you are going to see a whole bunch of cases where the ability to do looping in SQL in an imperative way becomes very powerful. Is the mic working? Is everyone hearing back there? You're good? Okay, the perfect mic is the one that I can't hear, right? I was just making sure everyone can hear clearly. Okay, so what I actually did here was I said, set statement timeout equals one second, which is, I had to, so it canceled itself, and I used with, and I used the word recursive, and this is the structure we're gonna see over and over again with recursion. There's this sense that you have to prime, prime the recursion, okay? In this case, I'm selecting one, that's my prime to prime it, and then I have a union all, again, I talked about union all before, to paste the second query that references my own common table expression name. This is where the recursion part comes in, right? So we have one become source, the second query returns one, another one goes up, one bub, but all of a sudden we're infinite loops, okay? This is effectively what's happening. You select one, it goes into source, you select from source, you have another one, it goes up, down, up, down, bub, bub, bub, bub, I can't get out, okay? And this is the pattern we're gonna see over and over again. Here is another example. I am now displaying the word hello over and over again. Remember the example I showed you earlier with imperative, where you just said hello, hello, hello, hello, effectively I have hello, I select source, source is hello again, hello, hello, hello, it doesn't stop, all right? Not very useful, but that structure we're gonna see over and over again, okay? This is interesting, this is interesting. It's exactly the same as this query, except for what? All, thank you, see, they all disappear right here, and there it is, and it's gone. And hey, I labeled in red, so everyone probably realized that's what was going on. And what's really interesting is the hello goes into source, the hello comes down, and when it realizes there's already a hello in there, nothing gets added, so it returns just one hello, okay? An example of almost every case you're gonna want to use union all in these recursive queries, yes sir. So in the different layers that I've talked about, where does that optimization happen? It happens very quickly, so you're absolutely right. Does it generate an infinite number of hello's before it terminates, right? It's really your question. No, the parser does not catch it. What actually happens is the executor executes the hello, it then goes into source, it loops around again, it then takes the hello and it tries to insert the hello into the existing batch it's holding, because remember it normally thinks it's gonna run and create like an in-memory or disk spilling copy of source, right? And because nothing gets added, it knows I don't need to come back here again and it just exits. Great question. Other question, to find out how to download the slides, yes, you're absolutely right. This gentleman is right on it. There is no where clause here in this second part, okay? And because there's no where clause, effectively it tries to go forever and frankly how would you get out of it and the way you get out of it is a where clause, okay? Right in red even. This is a query that counts from one to 10. If you think of imperative languages when you learn imperative languages, print the numbers from one to 10, do a factorial, right? I mean all this sort of academic business when you're learning a language, we're doing this now in SQL. In fact, I have a factorial example coming. So we're gonna go through CS 101 here. So we have select one, which is what we call our seed value. Starts for counter, primes it, okay? One goes up to the counter. Now I have select one and I check to see if it's one less than 10, yes. I return one plus one is two. And then I go back up here two, three, four, five, six. When nine comes down, nine's less than 10, I return 10, 10 goes up. It's 10 less than 10, no. So I don't return anything here and I exit that common table expression and pop down here to return the result, okay? And that is the result check. Now we do have an easier way of doing this in Postgres Generate Series. Makes it very easy to go from one to 10, but you get the idea. And this is how you would do it in Perl. Again, for those of you who know Perl, it's just an example. We have somebody who knows a lot about it here, maybe Randolph would know, but anyway, we have a little example here for people just so you can have a framework of what we're doing, okay? In fact, this actually, the example is not that accurate. This one is accurate, okay? And I'm not sure he caught that or whatever, but in this case, you're going from one to 10 and you're not really queuing the results, you're printing them as you go, okay? In this example, and I hope it's correct, effectively, what you're doing is you're creating a function F and then you're priming it with one, right? And the one is going up here, it's going into R, it's printing the one, and then it's calling the function again with two as long as it's less than 10, and then it goes, what the wrap? Frankly, that one still is not right, this one is right, okay? Because even here, I'm doing recursion, this one I'm not doing any recursion, but this one I'm doing recursion, but I'm printing the results as I get them, right? And that's not the way common table expressions work. They don't feed the result down to the main query until they're finished, so this one is right. I basically create a function, I start it with one, I push it into an array, then I call it again, I do it to 10, and then I output the function, the array. So this is really the right example because I am queuing up the values in memory into an array, and then at the end, in my outer query, I'm printing the array. Does everyone get that distinction, right? I went from a simple for loop, which is really not what we're doing in SQL to this, but again, this is not putting them as I generate them, which is also not accurate. This is the more correct example. Yes, sir. Well, I thought about making one test in here, but it was kind of simpler to do it this way, so this can construct in modern Perl. Basically, it says if I haven't passed anything in, which I haven't, use one. And I would argue that that actually is more accurate. Okay? We could probably spend quite a bit of time understanding which get Perl is exactly more like common data expressions, but I think this one got really close. Any questions? In fact, just inside somebody, I was at a conference and somebody said, you know, he said, I looked at your presentations and it says, though, you write them to be studied. He said, I guess because of the red, and so he said, I'll take it, but I'll just go through it slide by slide, and I'm like, there's a whole story inside of a single slide sometimes, and I think that's an example of this. So I used to be a high school professor and then did some college work, so I guess I'm used to that, but I also love it, so, you know, and I know we have a lot of teachers here as well. Trainers. Other questions? Okay, examples. Again, I've shown you a lot of academic stuff, a lot of stuff that doesn't make a lot of sense. Now we're gonna actually do some cool output stuff with this data. Okay, so, factorial. I promised you, you know, every time you learn a language, you gotta learn how to do a factorial, right? And we're gonna do that in SQL. I don't recommend you do it, but again, great example of showing how you take an imperative problem and you force SQL to do it. Now Postgres has a factorial function. There's no reason to do this, but it's very illustrative of seeing how to do something. And we see all of the common features, all of the common structures of common table expressions here, right? Starts with with, right? Which we know. Uses the word recursive, which we know, right? Then has a source and it labels the two columns, counter and product. We start with one comma one as a prime and the first column goes up by one each time. The second column takes the counter, adds one to it and multiplies it by whatever the product was. All right? And we do it up to 10 and then at the bottom we just output what we have. And if we look at the output, that's exactly what we think. There's 10 factorial right there, right? And we also have the factorials all in numbers up to 10 as well. You may not want that. You may only want to see 10. So what I effectively did here was I added a wear clause in the outer to say let's trim off all the other stuff that came from the common table expression. Just give me the 10 that I want. All right? And that's a very clean output right there. So yes, you're correct. If I try to do any filtering in the CTE, it's gonna, the effect of that filtering is to stop it from continuing, okay? Now I did do filtering here. Now what I cannot do, and this might be where you're going, I can't have this wear clause and get rid of that. Because the problem would be I would never finish the common table expression to get to that wear clause. And because there's an optimization barrier, I can't take the wear and pop it up into the common table. Yeah. Well, yeah. In fact, because the counter would never, it wouldn't equal 10 because it's starting out as one, it would just stop. Perfect, yeah, that's a perfect example. Perfect example. And again, keep asking questions. This is a tutorial, this is what it's for. You chose not to have those breaks. Let's do the Q and A, right? I mean, there's a wealth of stuff here. When we do window functions, it's gonna be a wealth of stuff. I guarantee you, like, whoa, what in the world was that? So that's, I'm just, I'm used to that. Yeah, yeah, yeah, yeah, yeah. So the question is, without recursion, is there value to doing CTs instead of use? The answer is, the answer is there are performance values to using CTs instead of use. Yeah, there actually are. And the reason is because there are some cases where you as an application developer know more about your data or you have a particular query that uses a very expensive function. And you want that function to be run under your control, not over the way the optimizer happens to reorder your function pull. Or certain joins you want done in a certain order and then outside you wanna do something else. So there are a number of developers who will take a long query and actually break it up into parts because they want to control the order. Now, we should probably add an option to turn that off, but my guess is we'll probably keep the optimization bar there by default, particularly for that reason, okay? Other question. Can you tune the optimizer so the CTs are not an optimization bar? I think we need to add that feature. It's on the to-do list, but we don't have not done that. So you cannot tune the common table expressions not to act as an optimization bar. I would like to see that happen, but it has not been done yet. Yeah, let's see, so effectively on this. And this is actually the pearl example of what we've talked about. Again, here's the priming, here's the recursion right here, and we're pushing the order array, and then we're outputting. Now, you might have seen me do a lot of numbers. You might think, well, what can I do with strings? I don't know, the answer is you can do strings too, okay? So here's an example. I'm priming it with an A, and then I'm concatenating an A, so I primed A into here, and then I say as long as the string length is less than 10, give me another A. Append an A, append an A, append an A, and select the whole thing, and I get all these As like that, okay? Not really exciting, but it works. So you might say, well, I want to have a B next to an A, and then a C. Well, if you kind of parse this big thing here, it's effectively taking the last letter, and then getting the ASCII value, and adding one, two, and then converting it back to a character again, and then you get this, okay? A, B, C, D, E, F, whatever. And there's some example. You can imagine some use cases. You might want to do this for populating a table, or something like Codetable or something like that. That's kind of cool. Yeah, this one always gets people kind of laughing. This actually, this is actually always a demo people use for comic table expressions. It actually gives you an ASCII art. I start with minus 10, and I go to positive 10, and then I print a certain number of spaces in an X, and a certain number of spaces in an X, and I get that, okay? So I'm drawing a big X. There it was just? There you go, exactly, big X. Here's another one. This should show you a little better what happens. We go from 10 minus 10 to 10, and we print zero spaces for 10, one space for nine, two spaces, and then again, this size gets smaller. And here, we're printing, I think, eight spaces or nine spaces, and then two letters next to each other, right here at zero, and then all the way down, right? You can do other things. I can make like a diamond by changing the way I'm computing this. So now it's a diamond. I can make it more rounded if I want, use a little power there. I got like a, I don't know what that is. It actually looks like a piece of bread. Everyone seems like a Italian bread like that. Hatchaporty, if anybody knows Georgian food. That's what I used in Russia. Everyone laughed at that in Russia. But there are some breads that look that shape. And then I can do kind of this. This is more like a Hatchaporty to me, a little wider, not really a great circle, but that's the best I could do. And I've seen ones where you do Mandelbrotts, Christmas trees, you know, I mean, there's all bunch of stuff you can do. I don't think it's very useful. It's very illustrative, I think, of that imperative aspect, okay? Here's another one, again, usually a CS101 example, prime factors. So every integer has a certain number of prime factors. For example, for 100, it's two, two, and five, five, 66, two, three, and 11. And you could actually do this prime factorization in SQL using case statements, okay, as I told you about before. Here's the prime factor, the 56. And I basically run through and I go from one, this is my counter, this is my remainder, and this is whether it's a factor or not. And again, I do it until the factor's one, and then basically for 56, I go down. And all of these that are true are factors, so two, two, two, and seven, are the five factors of 56. In fact, if I actually change my output and I say only show me the rows of the prime factors, this comes up a little better now, two, two, two, and seven, all my prime factors right here. Okay? How'd it go? Yes, sir? It actually, yes, the question is, does it execute this modular three times? Indeed it does. Yes, okay. Now, here's the prime factors that I just picked the number, three, two, two, four, three, four, right? And if I run it, these are the prime factors of that number, right? Very imperative. This is actually the prime factor of the 66. And again, what you're noticing here is it's trying every number. Like, why is it doing nine if it already did three? Like, kind of weird. Like, why is it doing eight? Why is it bothering to do numbers it already did? So we can actually optimize this. Of course, you know, make it faster. Here, we only do the odd numbers after two, okay? And now that trims down quite a bit of the examples we have, it has an exit case there, okay? And again, if you only wanna see the prime of 66, there's the answer, okay? This is, again, the example in profile for that. Okay, questions. So now we're gonna get into an, okay. So now we're gonna get into an actual application that kind of you might do as a real person, right? It's always a good idea. Here, basically, what we have created is a part table. And what we're doing is we're inserting a whole bunch of parts. Now, the interesting thing about this part table is that it has parts and then it has subparts. So think of an airplane. What does an airplane have? It has fuselage, wings, wheels, tail section, right? But then inside the wing, there's a whole bunch of parts. Inside the fuselage is a bunch of parts. The wheel has a bunch of parts. The landing gear, fuselage, tail thing, right? So effectively, for each, we have a number of parts like part one, but then we have subparts for part one and then part two and subparts for part two. And then part 11 has some subparts and part 13 has some subparts. 22 has some subparts and 23 has a subpart. So think of this as a tree, okay? And the tree uses a sort of tree traversal in FQL, it's very common for commentator expressions. Without commentator expressions, if you need to traverse a tree, you have to effectively call the, you have to issue a query. Give me the top level parts. And then for each top level part, you then have to extend the query again. Give me all the subparts of the fuselage. And then for all those subparts, give me all the sub-subparts of the fuselage, right? And you have to keep going down, down, down. You're issuing queries over and over again from the client over and over again. What commentator expressions do is they give you an ability to do all of that tree walking in FQL. Again, that imperative aspect where you're telling it, walk the tree of the parts, walk the tree of the plane and do certain behaviors. So this is how we prime it up. Here is a recursive commentator expression. We're saying for part two, which is this set right here, give me all of the subparts that are parts. So if this is the fuselage, part two is the fuselage, 21 and 22 are sub, 23 are subparts. And then these are sub-subparts of the fuselage, okay? Now you might not like that, but I mean this is a very simple example. All we're doing here is we're priming it with two. We're saying give me, join the source table part number and give me all the subparts and then return that two up here and then keep going. So that's how your leap is going here, like that. Okay, walking down, yes. So that is a great point. This is one of the first cases where the subquery here is returning more than one row. Because up until this point, we've always had it returning one row and one to 10 or whatever, right? Now our subquery or this query here could return multiple rows. And in fact, that means that this three rows get added there and then three rows get instantiate. Now it's a three row table instead of being one row table and it's gonna roll through. Now because of the way this structure is set up, I don't have any leaps. So I'm never gonna loop around and end up backwards started from. Okay, because that would be bad, right? And you can't say that the wheel is part of the fuselage which is part of the wheel. Like it doesn't work that way. Nobody would set up a plane like that because it would never fly, right? But one interesting use case that I didn't, I normally don't cover this, but I've been sort of harping on the fact that we wanna use union all in almost every case. Now if you're worried that your data has loops in it, okay, this is an excellent, and maybe I'll add a little mention down here now that you picked that up. Nobody picked that up before. Maybe if you're worried that your thing has loops in it and you're willing to have the overhead of the duplicate removal, okay? You may wanna remove the all here. And what that will do is prevent it from going into a loop just like back on slide 20-something when we did the hello and we didn't use union all, it didn't loop. Do you remember that? Because it knew I already had one hello, I'm not gonna add another one. So if you're worried your data has loops, you might wanna remove that union all there. Right, yes. So all it's gonna do is just gonna remove the rows that were duplicate, any other unique rows are still gonna be set in. Okay, it's not gonna stop on one. It's gonna say, oh, I have three rows. One of them is duplicate. Now I'm only gonna pass two rows up and loop on those two additional rows. Yeah, yes, ma'am. Great, thank you, yes sir. So you're asking if you can add a limit one here? Out here? I'm sorry. Here? The problem, if you add a limit one here, you're only gonna get the first part of the subparts. So if you may have an example, this would be a great example. So if you added an order by here, and you said I wanna see the closest friend of this guy, and then I wanna see the closest friend of that guy, of his friend, you would eliminate all of the secondary friends, and only get the primary friend if you could have a ranking of friendship for the guy or woman, I don't know. I guess there might be some cases where you would not wanna fully go through the whole hierarchy of sub-objects and just pick one, but you'd have to have an order by in there. And what I don't remember is that the order by is executed for each iteration or only at the end. I think it's each iteration, but I have to look at that. Nobody's ever asked me that question. But normally you're gonna wanna use probably not limit, unless you're sure, and you're probably always gonna wanna use union all, again, unless you have loops in there and then you wanna eliminate the loops and that would be a good way of doing it. Yeah, yes sir, in the back. Can you be a little more specific if you use some acronyms, I didn't know. Yeah, so that's interesting, and I'm gonna have to go here to kind of talk about that. So if I was returning the parent of these objects, then the union all would get me into an infinite loop. So one of the reasons that I'm just returning the sub-part that goes with the part is if I returned the parent, the parent would keep repeating and I'd never get out of the loop. So I think any time you're gonna get duplicates, unless you get rid of that all, you're never gonna get out of that query. But there might be cases where you'd wanna stop at a duplicate and that's where you would get rid of the all, basically. So if you have a child node that has no parent, multiple parents, oh, oh, oh, oh, oh, oh, oh, oh. So if you have a child with multiple parents, you're going, so that means you have multiple ways of going in and in that case, you probably would remove the all because then you would get rid of, yeah, that's another example I hadn't thought about, but you're right, you, yeah, you would have it, your child may appear multiple times and if you use union, if you got rid of the all, it would only appear once, that's right. Now, there's another trick to this. There's another trick to this and the trick is that I'm only returning the part number, I might want to return the parent to go with it and I'm gonna show you some examples in a minute, which really go a little bit onto the steroids kind of range, because right now I'm just returning a number, but wait till I show you a little bit more here, okay? So this I'm just returning the part number, right, straight and again, it doesn't handle the case where a child might be part, you might have the same part in the fuselage and the wing, like the same bolt might be used in both places, so that's a good example. So in this case, I'm actually returning the level with the part, okay? So here now I'm actually putting a level and I'm having a different number of dashes depending on what level it's at, so now I'm displaying graphically how far down this subpart is, all right? Using this part in red here. And I can also order it, so now instead of having the part and the subparts, notice this, actually these two parts are parent of this one and this one's parent, but they're all kind of bunched together because of the way I've ordered it, okay? So now I'm gonna order it by the tree and the tree is effectively a concatenation of all of the part numbers, okay? So now I'm getting not just the part, and gentlemen, you should take a look at this. Now I'm not returning just the part, I'm returning the hierarchy of the part. And that's where I start to get rid of my, where I don't have duplicates because it's the same bolt appears in two places with different parents, different tree hierarchy, right? So effectively what I'm doing is I'm now giving the tree hierarchy to come down, but this is kind of weird because I'm actually doing it as text, which is kind of awkward, so in this case I'm actually creating a Postgres array and I'm pending the parts to the array and now I'm getting a nice array that I can order by. So instead of doing the ordering by ASCII, which is really terrible, don't do that, right? Because again, it's just not gonna work the way you want it to, like, what is it? Nine is gonna appear after 11 because 11 is ASCII less than nine, you get, right? So by doing it this way, I get everything in an array and then I get a graphic. So in a lot of cases, you wouldn't want just the part, you'd wanna see the whole hierarchy like that. And now if we have a part that appears in two places, it's gonna appear two different times even though we use union all, right? And we're gonna get a clear, this part appears here and this part also appears in this hierarchy. And the full output is basically this. You can see it, the different levels, the different part numbers and then the tree and then the array that gives me the structure of all that. So again, this is kind of taking that part, building more and more and more until we're starting to do some really useful stuff. Again, not having to repeatedly send queries over and over to the server, we're actually just sending one query, the server's doing all the work, much more efficiently than we could do and then returning the result back to us. I think very, very powerful. Any questions? This has a dependency table inside built in. So I'm just gonna show you an example of displaying the Postgres dependency table. This is an internal table that's only used by Postgres. You'd never use it yourself, probably. But here, if I create a table called depth test, I can actually look at the Postgres dependency table and I can actually output the dependencies that come from that new table I created. And if I add a, if I don't want to see depth, these are the dependencies without seeing the major one. Now if I add a primary key to the table, I now have two new dependencies, right, from my table. And if I add a serial column, I now have four new dependencies, okay? And again, you can see how internally Postgres has this dependency table. One table has a couple, six dependencies. And again, I can walk that hierarchy within this complicated SQL query. This is very Postgres-specific. Again, you would never use this, probably, with the idea of creating a dependency graph in SQL, obviously very possible and very easy to access using Common Table Expressions. And effectively, this is the tree right here with arrays of object ideas. As you can see, some of them are primary, some of them are sub-objects, some of them are sub-sub-objects. And again, we could go down as far as we wanted. We're getting to the sort of end, where I promise that you would be confused or at least surprised. And what we're gonna do is we're gonna conclude with writeable Common Table Expressions. Again, this is Postgres-specific as far as I know. But it does take Common Table Expressions to a whole new level in terms of what they can do, okay? And once we're done this section, once we're done this slide, we'll take a 20-minute break, go out, have something to drink, or just stand up and walk around. And of course, when we come back, we'll do window functions, and hopefully your brain will have relaxed by them. Because the window function talk is, again, very complicated, so I need, you will need that break to sort of let your brain unwind a little bit. At least I always do. Any questions before we get started? Okay, so writeable Common Table Expressions. They comment, writeable Common Table Expressions. So these two bullets are not the same thing. I will try and highlight the difference. Writeable Common Table Expressions allows data modification commands in the WISC clause. So you've normally seen only selects with Common Table Expressions, all right? What we're going to show you now is cases where you do insert, update, delete in the WISC query with a select at the bottom. So the main query is a select, but inside that Common Table Expressions is an insert, update, delete with a returning clause, which I'll tell you about in a minute, okay? That's bullet one. Bullet two, allow WISC clauses to be attached to insert, update, delete statements. In this case, the query inside the Common Table Expressions is a select, and the main query at the bottom is an insert, update, delete, okay? So two examples. One has the insert, update, delete in the WISC query. The second example has the insert, update, delete in the outer query, that's the bottom query. Does that make sense, everyone? That distinction? Okay, first example, we're going to create a table called retdemo. I don't know why I call it that, return demo, I guess. And we're going to insert three random numbers. But we're also going to use a special Postgres syntax called returning. This is not SQL standard, I think. David, can you confirm that's not a, I think it's, it's all right, it's a Postgres baby there. Yeah, it kind of takes this stuff to the next level. So for example, if you're inserting three random numbers, you really don't know what those random numbers are. But if you do a returning clause at the end, you can actually get the results back from the insert. All right, kind of cool. Here's our first example. Inside the Common Table Expressions, I'm doing an insert of three random values. And I am using the returning clause. What does the returning clause behave as? It really behaves like a select, right? Doesn't that look like a select output to you? Right? So it's an insert with a select kind of at the end, which only returns the rows that I inserted. Kind of cool. So here I can say, insert three random values, return as a select, and then give me the average of those three inserted values. Kind of cool, okay? So three random values, give me the average, there we go. Here is another example. Delete all the rows in the Common Table Express. Give me all the rows in the table, but return a select of all the deleted rows, and give me the maximum one. What is also very interesting about this is this happens as a single snapshot, as a single transaction. So this would be very hard to do in another example because you would not be sure you got all the rows. Like somebody could have inserted a row between your delete and then you're like doing some weird stuff with me. Okay? So the delete happens in the same snapshot, I'm returning all the rows, and then that results coming right back and I'm doing the calculation. There's no way to get in the middle. No way for somebody to get in the middle there. There's literally no way to do this unless you just lock the table and then in a transaction block, delete it, select it, delete it, and then give them that kind of awkwardness, right? Whereas this is like bam, one query. Yes. So is this whole thing atomic if one of these was to die or fail? It's completely atomic. It is one block. Now if you're doing this inside a transaction block, then again it's a single query, and again it's controlled by, it could abort the whole query, you could have a save point in there, whatever. This is one single query. If you're not using a transaction block, then effectively it's a transaction. Yeah. Could you use that to move from one table to another? Absolutely, you can use it to move from one table to another in an atomic way, that's right. And what you could do is you could, yes, you could delete it, this is exactly where you're going, you delete it and then insert it, right? I have an example, I'm getting to it. I will show you an example. So this is the second bullet. This is the case where we do a select inside the common table expression and then we do an insert update delete outside. So here I'm gonna do another table called vet demo two and I'm gonna insert three random numbers. I just see the law of random numbers. And here I say with source, give me the average of my rows in vet demo two and then delete everything less than the average. Very, very hard to do in normal SQL unless you lock the table, right? And then do your operations, right? This is one query. So we're basically saying give me the value and then I actually, now I can look and see how many rows are left. In fact, because I've deleted all the lowest values, I only have one row left. Yes sir, where am I, what's the transaction? Where am I committed to transaction? Postgres is in auto commit mode. So by default, every query is its own transaction that's committed when it finishes. It is not like Oracle, which is not in auto commit mode. You have to commit after every command unless you turn that off, right? In Postgres, the default is to be in auto commit mode. You can turn that off, but by default, it's in auto commit mode. Oh, so the question is, insert the data manipulation language. Postgres effectively, that entire query is one transaction. Whether you do the insert or delete, I'm gonna show you an example in a minute that doesn't insert and then delete and then another insert. That is one query, that is one transaction block if you're in auto commit mode. You can even put DDL in a transaction block in Postgres, whereas Oracle, yeah, it just goes crazy. So again, we're getting on to steroids here, but we've had this for 20 years. We've always allowed DDL in transaction blocks, certainly DML, multiple DML. In fact, DML is pasted together with common table expressions, which is what we're gonna show you in a minute. I haven't even gotten to the most complicated one yet. Okay, yes, sir. So yes, it does. So this is auto committed, this is auto committed, this is auto committed, and that is auto committed. Thank you, thank you, very good point. Very good point, thank you. Other questions? Okay, so here's another cool one. Basically, I'm saying take part two, generate all the subparts and then delete them all of a common table expression that does a recursive select and then delete everything it returns. One thing this does not do, and I need to go back to this gentleman, if you had a subpart that was used in two places, I'm afraid it would delete it, which is not good. If you had, yes, if you had a subpart that was used in two places and you didn't delete one of them, the other one would delete, you had a whole thing would go away. That's a problem, yeah. So maybe that's something we should work on. But again, this is a tutorial, you know, it's a good thing. I feel like I really hadn't thought of it. You started talking about having a part in two places and then I started to think, oh yeah, that would be a problem. Other questions? So again, great example, walk a tree, delete a tree. I mean, we did walk a tree, return a tree, now we'll walk a tree, delete a tree. Really, really very convenient. Here's where we can start to go, woohoo. So here, I have a red demo three. Again, as I told you, I have the SQL, if you want to run it, you can run the whole SQL command, fly it off your screen, red demo three, put three random numbers in, okay? Give me the average of the three numbers in the table. In the second common table expression, delete all the rows that are less than the average and then return what's deleted. I could also add it to return what's left if I wanted to, but I didn't just want to delete it, so we're saying, give me the average, delete everything less than the average, show me what I just deleted, yes sir. One slide, okay. So here we start to go, this is actually, I think, really useful, frankly. Here we have an orders table, which has an items table, so standard, you have orders and each order has a bunch of items in it, right? So classic case. So I have an orders table, I have an items table. Now I can say insert into the orders table, but you might notice that this is a serial column, so therefore it's order numbered, okay? And I'm using the fault, which says, insert in the orders table and then give me the default order number. So assign the default order number during this insert. This is the way it normally works, right? Normally you then have to find the order number with next val or left val per val, and then use it that way. But what you can actually do is you can say return the serial column I just assigned and then, you know, insert my order and using that same order I just used. So what I've done is I've inserted in the orders table and the items table in one statement. Instead of having to do the query and then say, give me curve val and insert curve val up, I'm doing it together, okay? Here's another one, delete from the orders table, return the order ID, and then delete all the items for those orders. So again, just like that, delete the orders and then as the same thing, delete the items for the orders. You can see how you start to combine stuff together becomes very, very powerful. Here was another one, delete from the orders table, then delete from the items table and then take the order number and insert it into another table like as a backup. So I've got an old, like I have an old orders table and I wanna just say this is when I deleted it. And probably I'd probably add like a curve val here or like current timestamps so I know when I deleted it. But again, this is just an example, right? But that's how you would defect if we do it. If you wanted to move data, which the gentleman was asking about, right? You would delete and then you would insert into the other table. And in one snapshot, in one visibility, right? You'd move the rows from the one, you'd be deleting them, they'd be returned and then you'd be inserting them somewhere else and then commit it together. You can't lose any data, there's no visibility. You don't even have to lock the thing. That's the beauty of it. Normally to do this, you have to lock the table, select it, delete it, insert it and delete it. And while you're doing that, the stuff's locked. Here you're locking each row as you're deleting it, but you're not deleting, you're allowing other inserts to happen to the table at the same time, right? So it's kind of like I'm moving stuff around, but at the same time people can be adding new rows that aren't part of my snapshot. I don't see them, I don't delete them, nothing happens. Tremendously, tremendously powerful. This is not only making your life simpler in terms of having one query that does a bunch of things, but it's allowing you to do something concurrently that's very hard to do any other way. Why use common table expression? Well, almost done. Allows a period of processing of SQL. Merges mobile SQL queries and their connected application logic together, unifies them, okay? Improves performance by issuing fewer queries, less transmission overhead, less overhead for optimizing, for example. Uses the same snapshot for the entire query. Very hard to do this outside of common table expression. If you have mobile queries using the same snapshot, you can use repeatable read mode if that kind of gets you around some of that, but then you still have to lock the table, which you probably don't want to do. But it does add an optimization barrier, okay? And that is all I wanted to cover today. Four common table expressions. So what we're going to do is it is exactly three o'clock. Is that right? Yes. And what we will do is we will break until 3.20 and we will come back and we will talk about window function. So 20 minutes, have a good time. Just to remind, we will start promptly at 3.20, so we'll get started in a minute. We'll get started in a minute. Okay, thank you, welcome back everyone. I won't need to go through introductions again because we did that. Same story, presentations are on my website. Resmobjian, work for enterprise DB. One thing I will tell you about this talk is that we implemented, Postgres implemented window function six or seven years ago and at that time I thought, hey, this would be a great talk. Let me write a talk about window functions. And I started writing a talk and I got stuck. And I kept the SQL around. Again, as I said before, I write the SQL and then I write this talk. And this summer I got a chance to go back to this topic and for some reason it gelled in my head and I was able to write the talk. So this is, some parts of this is six years old. And the reason I got stuck is because there's some aspects of window functions that is really hard to get your head around. So the first part of this talk will be getting your head around some concepts and some terminology. So as we get into some actual use cases, it will start to make sense. So again, this should be about an hour 10. Take us to 4.30, finish up. I will be around, of course, to answer any questions or help anyone after that. I might go to 4.35, might meet next to five minutes. We shall see. Any questions? What are we gonna talk about? First we're gonna introduce window functions, what they are conceptually. Then we'll talk about the syntax of window functions, very similar layout that we had with common table expressions. Then we'll look at the window syntax with generic aggregates. They may not make sense to you, but these are things like sum and count average. Then we'll talk about window specific functions, things like lag, lead, rank. And then in section five, it really gets interesting where we take a real world example and do analytics on a particular table with a certain number of employees. And finally, we'll finish with discussing considerations of window functions. So, introduction of window functions. What are window functions? Window functions are part of a larger set of features that Postgres supplies for data analytics. Things like data partitioning, materialize views, common table expressions right there. That's something you can use for data analytics. Bring index, grouping set, roll up, optimize aggregates, table spaces, all things you would use for data analytics. Window functions is merely an additional feature that people use for data analytics. And again, follows the same pattern we did before, the concept that using window functions makes your life easier. As application programmers, having window functions and not having to reinvent the wheel and giving the database server the problem and letting it solve it is often the best and most efficient approach when you need something done. And that's, again, sort of the policy we've come through this entire thing. So, what are window functions? This is a section from the Postgres documentation slightly modified by me and committed to probably six or nine months ago. So, this is what the Postgres 10 box looked like. Effectively, just to highlight some words here, normally an aggregate will reduce or aggregate a number of rows into a single row. However, window functions do not work for closed rows to become grouped into a single output row like non-window aggregate functions would. Instead, the rows retain separate identities, okay? Behind the scenes of window functions can access more than one row, okay? So, when you think of running an aggregate, you think of taking a whole bunch of rows and getting a number of assets. In window functions, the rows retain their distinct characteristics. You do not eliminate rows when you're using window functions. And that is probably the most fundamental difference between normal aggregate and a window function aggregate. I know it doesn't make a lot of sense right now. There's a whole bunch of stuff I'm gonna talk about in the next 10 minutes or 15 minutes that doesn't make any sense. But they will make sense by the time we're done the talk because I'll be talking about concepts that you think would have nothing to do with window functions but in fact are fundamental to understanding them. As I said earlier, a lot of people have seen this talk of used window functions before, say, oh, I thought I understood them. Now I really do understand them. In addition, there's a lot of people who when they need to do a window function, just do a web search, find a window function that kind of does what they need to do, whack it around and hope it works. That works some of the times, there are some pitfalls to that. So I believe if you're gonna use window functions sort of put it on autopilot, you should understand what you're doing with these words you're typing and that's really the goal of this talk. Any questions? Again, keep your eye on the red because we're gonna use the same pattern of looking for red text within. Here is a query I'm gonna use repeatedly in the first part of the section, the function called generate series, which effectively just degenerates the series the numbers from one to 10. In addition, just like the common table expression talk, we have a window.sql file, which is this entire talk as an sql file that you can run in your browser or in PSQL, okay? Yes, this here? Yeah. When you call a function inside of a from clause in Postgres, you need to label the name of the virtual table and the name of the column in that table. You could kinda guess that the function would be called generate, the table would be called generate series. But it's not really clear what to call the column. So we kind of require you and the sql standard kind of requires you to actually label what that function's name is for the virtual table and the name of the columns. You could have more than one column. Other questions? Here's our first window function. Wow, we got the window function on slide eight, right? Why do I know it's a window function? Because of the word over. Anytime you use the word over, you're now taking an aggregate and making it act as a window function. It's just what it is. That's what over does, okay? There also is an optional frame clause and other clauses you can put in parentheses. In this case, we don't have any. So we're just gonna use it as a fault. And when you actually look at the result, it's really contounding. Like this is the best we could do and indeed it's not the best we can do, but that's the default behavior. The default behavior for some with no print, nothing in the parentheses, is basically to sum the entire set and place it on every single row. One of the aspects that you need to realize here is we have not reduced the number of rows. Generate series, generate 10 rows. We have 10 rows. What if that could really happen? And again, it was in that description before. From each row, we have reached up to all of the other rows and something. So for each individual row, and you can't see it here, you'll see it later. For each individual row, we're reaching around to the other rows and putting a value on that row that is representative of values not only on that row but potentially on any other row in the set. I'll say that again. That number, the number for a window function represents not just that row but potentially can be affected by other rows in the set. Add them up only once or does it reference? Actically, I think it's doing it every time. I don't know if we optimize this case or not. I've never looked at the code. We could have in this case, but in other cases, we couldn't have. So whether we do that or not, I don't remember. And in fact, I don't even think explain would tell you. You'd have to look at the C code to see whether, when we realize that the frame clause is gonna affect all the rows the same, it would be pretty easy if you look at the code because you'd see a parameter that said all the same or something like that, but I've never looked at the code. Here's our second example. We're doing two window functions. We're doing a count and we're doing a sum. And the count is counting all 10 rows because the 10 rows were just counting one to 10. So each row can see 10 rows and each row sees a total of 55, right? Very, very interesting. Again, both these are the false. And I can do this. I can actually add a window clause. Here, I'm adding a window clause called w, which is the default. And now I'm referencing w in both of the window clauses. This is a shorthand. If you're using the same, you know, window specification multiple times, you can give it a name and then use that, you know, reference the name any place you want in the query. I mentioned there's stuff in parentheses. Literally, this is the default meaning when you don't put anything in the parentheses. The default meaning is range between unbounded preceding and current row. Your first tarp hit is the current row does not mean, does not always mean the current row, right? This is probably what got me, you know, off the wagon when I started, okay? So I'm gonna talk about that in a minute, but just red flag that current row is not always the current row. Any questions? Okay, so let's look at the window syntax. So here we have the full syntax for a window function. You have an optional partition by clause. You have an optional order by clause. Then inside, you have what's called a frame clause, which can say the word range or rows. And then you have a frame start or a between frame start and frame end. And the frame start and frame end can be any of these things. Again, here's the syntax if you wanna look it up in Postgres. All right? This is quite convoluted. I'll be talking about all of these things in the next range between unbounded preceding current row. The default you get. Okay, so the default has no partition by, has no order by, and has the word range right here. Okay? And again, because there's, and again, I'm getting, this is just something I'm gonna have to talk about in a minute, but range means current row defaults to representing all rows, which I'll talk about in a minute. Okay, so current row. When you see current row either as a default or you're literally typing current row in your query, it can mean three different things. It can mean what you and I would think it would normally mean, which is the current row, okay? It can mean the first or last row with the same order by value, technically called peers, okay? Or it can mean the first or last row with a partition or the set. And depending on what other words you use in the window clause, current row can mean any of these three things. I don't think this is a great design from the SQL standard committee, but this is what the hand we've been dealt and this is the hand that we have to play. Current row, current row will mean literal current row only in something called rows mode. All right? If you're in rows mode, you're literally only current row. If you're in range mode with an order by, current row means the first or last row with the same order by value. I'll be showing examples of this. Dave looks like he's getting into gestion. I don't blame you. I remember going to my wife and saying, I think I finally got, you know, like Einstein or Alexander Graham Bell. I think I finally got it because I just kept digging and digging and digging to find out all this behavior, which is documented in Postgres, documented as SQL standard, but because the names don't mean what they think you mean, it's a very hard process to, it's like if I say that, if I say that this digit means six, right? You know, and if I type, you know, this and a two, well, it's actually 62 not, like the number, it's like a number it doesn't look like a number anymore, it's a different number. It's kind of the same. The current rows isn't really current row, it could mean something else, depending on what day it is, right? I mean, almost. That's where you start to get confusing. And again, if you're in range mode and you don't have an order by, current row means the first or last row of the partition. Questions? In digestion? Okay, so let's graphically look at this. This is the literal current row right here, okay? I'm just doing some example with some duplicates so you can kind of see what's going on. This is the literal current row, okay? Literal current row and rows both. Is the current row in range mode with an order by X? It's the two values that have the same order by value. Yes, man. So the question is, do you have a 30 second explanation for what a partition is? I have a slide, I think back in 50 something that actually shows you partitions. Up until we get to that slide, just assume the partition is the whole result set, okay? So by default, if you don't use the word partition, partition is the whole result set. So when I say here, first and last row of the partition, it's really the first row of the result set. That's why we got 55, actually. And I'm gonna show you exactly why I'm gonna walk through some examples, okay? Great question, other questions. The window frame in rows mode between unbounded preceding and the current row, this is unbounded preceding to the current row in range mode, are you with me? All right, I added an extra one here because I went from range rows mode to range mode, right? And this is unbounded to the entire partition or the entire set, exactly what you're saying here, right? In this case, there are some cases in the docs where we talk about partitions, but it's the whole set because we haven't defined one. But then there's other cases where when you define a partition, then it blocks off the set into separate partitions. And I'll show you some examples, probably in the 50, 60, 60. Here's a better example. This is actually the SQL you would use for each case. Okay, so just written out. Rows mode, current row to current row, bingo. Makes sense, right? Start at the current row, end at the current row. I'm in rows mode. Order by x, range current row to current row. The current row from beginning to end, okay? This is rows unbounded preceding to current row in rows mode, order by x, unbounded preceding is this, and then this is rows between unbounded preceding and unbounded following. Any questions? No one's ill. Good. Now we're gonna start to get through the weeds a little bit. This is not a pleasant part of the talk. It's like, I don't know how to explain it. It's sort of like, kind of you get in the ocean, and there's all this sand and jellyfisher around, and eventually you get out to an area where it's nice and clear, and we're kind of getting out to that clear area, but there's a lot of jellyfishing, like shells on the beach that are gonna hit you in the foot, and just like yucky stuff, you don't wanna know what that is on the sand, like, I don't know, you're not like, oh, what is that, from dead animal or whatever? That's kind of where we are, right? We're gonna show you specific examples now, walking through what I just talked about, and I think this is gonna be very clear. Again, keep your eye on the red text, because that's where the differences are in the queries. So, this is the query we had before, exact same query. And remember I said that the default is range between unbounded proceeding current row. The reason this looks this way is because we're in range mode, we have no order by, so the current row is the last row in the set, or the last row in the partition. I'm gonna back up again a little bit. I talked about this. Current row can mean first or last row in the partition range mode without order by, right? Yes, sir. If you say rows between current row and current row, you get just your current row. That's right. If you say range between current row and current row, then you get all of the rows that have the same order by value. If you have an order by, and if you don't, you get the entire set, right? I think I have an example here. It's gonna, you're definitely gonna be ill with that one, but I do have an example. Okay, so this is range between unbounded and receiving current row, and again, current row, unbounded receiving, we, that would make sense to us. Current row, because we have no order by, and because we're in range mode, we are getting all the way to the end. If I change range mode to rows mode, this is the first time we've actually seen specific changes on individual rows instead of a fifth the same number, repeating over and over again. Yes, sir. And I challenge you to say that three times fast, but anyway, that is exactly it, yes. That is it, yes. And yes, and you had to select every word, as you said that, to be accurate, yes. This is why I gave up six years ago, yes. And why I finally cracked it. Now, this one is interesting because, effectively, what it's doing is it's going from the first row up until the literal current row. So one is one, just. Three is two plus one, six is three plus two plus one, 10 is four plus six, right, all the way down. And the 55 is here because we know 10 is one is 55, right? Nine is 45, nine to one is 45, right? So it's actually behaving in a predictable way here. Yeah, that would be bad, yeah. When I get to the section four, there's a whole bunch of manipulation you can do where you can reach back two rows or forward two rows or really cool stuff, yeah. I'm just starting like, you know, let's get to the water first, right? But we're gonna get to a point where you can do all sorts of cool manipulation. Right now, again, just like comment table expressions, probably no one's ever gonna do this actual query, right? There's no reason to. But illustrative-wise, I think it works really well. In fact, if you look at the count, it's literally telling you not the number in the first column, it's telling you how many rows it sees. That's what we're asking for. The second row is count. So how many rows, I see one row here, I see two rows here, we see three rows. All the way down, it's not copying this row. It just so happens to be the same number, that's it. So in the example, could we call current load the current frame? No, no, because a frame, a frame is effectively a beginning and end of a set. So it's really not the current frame, it's a row. The frame literally has to have a start and an end point. So you can't say the current row is a frame because it doesn't make any sense. You can say current row, in that case, is the frame end, because if you look at the syntax, that's exactly what the label says. Yeah, but I'm not sure it gets you anything. Somebody had a question here, yes sir. So the question is, is it actually going through and actually calculating this for each one? The answer is yes. The good news is that nobody ever really complains about performance for window frames because effectively what you do is you run the query and then you run this frame macro over it. And effectively all the data's in memory, it's not that big a deal. I've never heard anyone complain about the performance of this. And it's certainly always gonna be better than trying to do it in the client application. So I really would not be concerned about that. Whether we optimize certain cases, I can't remember. Whether we do it or not. Yes, please. Yeah, it's pretty minor. Yeah, thanks, I answered your question? I understand, yeah, and then you're doing, so the question is if you do something iterative and it's long strings and stuff like that, yeah. I don't know, I've never seen people do that kind of thing. But it sounds interesting. Yes, sir. Okay, so if you don't say between, then the frame and default to current row. Except we don't know what current row means. It defaults to the text current row, whatever that means, yes. And I'll show you some examples in future slides, which actually show me using the default. I haven't gotten there yet, but I have a whole bunch of slides that show you one query with the default and one with nothing there and it'll default to current row always. Right, because that, again, you can specify the beginning and then the end is just the default, right. Other questions? Actually, this is an example, here's an example, right here. The full end frame current row, boy, you must be looking at the slides, right? There we go, right. So effectively if you don't, I didn't even remember that was the next slide. So if you don't specify an end frame, it defaults to current row and effectively you're getting the same result here. And I'm saying unbunded proceeding and what is not specified the fault is end current row between end current row. Okay, so that's, you use the word between and end current row there, but that again, the same result. What if I say this? Rows, when rows mode, because that means current row literally means current row, okay. So rows between current row and current row, and what do you know? Every row sees one row and the row is actually itself, because it's summing itself one time. Right, actually makes sense. And I can do this. If I say rows current row, what's the fault for this end? Current row, right. So it's current row to current row. That's the same result. Current row to current row, same aspect. Wind one for the team. Here's the whole thing flipped over. Now I'm in rows mode, which means that current row literally means current row. And instead of saying unbunded proceeding to current row, I'm gonna say current row to unbounded following. And I just flipped the whole thing over. I start with 55, which is one down to 10, and then two is 54 down, three 52 down, and then the last row is just itself. I just flipped it backward. Yes sir. If you did rows between current row and current row, would you effectively nullify the purpose of doing a window? Well, the problem is I'm only showing you like numeric aggregates, but there's a whole bunch of custom aggregates. So I think you probably would. You probably would never want to do that because the whole value of window functions is to be able to reach around outside your current row. So if you did current row to current row and you're in rows mode, there's probably no reason to do that. I can see a reason of redoing current row to current row in range mode. Or if you had a partition, yeah. Yeah, that would be another example, yeah. Yeah, yeah, yeah. Actually, it wouldn't matter whether you had a partition or not. Well, we haven't got a partition, so what's the skip? Okay, here's our first interesting one where we start to do something different. I'm now saying rows mode between one preceding and current row. So I'm saying I'm in rows mode, so little current row literally means current row, and I'm doing one preceding. So now I find the count, and the first row has no preceding row, so it only sees one. All the other rows see two, right? If I do the count with an X instead of a star, you'll notice these are the same. The difference between count star and count X, this one will count null, this one will not. The point is that these are not, these missing rows are not null, they're actually just not there. And they are not counted. What do you star or not? But what's interesting is the sum is the row plus the row before it, if there is one. So it's three, two plus one, five, three plus two, seven, four plus three, nine, five plus four, all the way down, 19, 10 plus nine. One following, same trick, other way. All rows see two, except for the last one, okay? Three is one plus two, five, two plus three, all the way down, 10 is 10, because there's no way to ask for it. Three preceding. Now I'm going three back. So you'll notice that up until row four, you don't see four rows, because there isn't four rows, three rows before you do these rows. So the first row only sees itself, second row sees two before, third row sees three before, fourth row finally can see three before and all the way down, okay? And you'll notice the numbers, one, two, three, they add up, 10 is four, three, two, one, six, three, two, one, 14, five, four, three, and two, and so forth, all the way down. Any questions about that? It's pretty straightforward, I think. Okay, and again, these are, we have a whole section, which actually walks through these examples and does an analysis of employees. So just hold on for that point. We have to finish this section and then we've got another section to go before we can actually get to the examples. When we get to the examples, all of this kind of comes together. You can start to see some actual analytics on some actual data, which returns some real results. But we aren't there yet, I have to get us there, right? So let's look at order by. Remember what I said, when an order by happens, all of a sudden current row in range mode means all of the values that have the same order by value. And I'll show you some examples of that. So order by x, I've left everything as a default and now you will see that the current range between current row, actually this current row is literally almost the current row here. Current row peers are rows equal values for the order by columns or all partitions if the order by is not specified. Here we do have it and we're getting a different result because we've added the order by. Yes, yeah, what's this right here? Exactly the same, right? So when I don't specify anything here, I'm really getting this. And when we now see the text, it makes sense because we're range between unbounded proceeding and current row, I have an order by. So the current row means all of the values that have the same order by value. We have unique values here. So this order by in for this data, order by ends up being just the current row. But just for this data, and in a minute I'm gonna show you some duplicates and then it's gonna start to make sense. So practically for this data it is. But I would almost tell you never to use the defaults. Just spell it out because it feels like the defaults are so error prone that it's really just not worth trying to use them. I realize it's a lot of text. You can use a window clause and give it a name and then just use that name everywhere. But there's just really no value of not using these defaults because the defaults will trip you up. And I'll show you some examples in our real world data that will trip you up. So here's another one. Now this is order by X range, current row to current row. And because they're all unique, you basically get this output. And each row only sees itself because there are no peers. Each row has no peers. All right, you read the Postgres docs or really any docs on this. You're gonna see terms like peers and partitions and frames. And a lot of times when I see here the word partition or frame or set, I think they're the same thing. Like if the frame is the partition, it's kind of the same set. No, in Windows the peers, they're like completely different. They have very technical meanings and one of the problems, I have improved the Postgres docs a little bit, Postgres 10. But even then when you start to look at the actual text they use, you can't sort of gloss over a term and say, oh, that's the same as the mother term, it counts the same because they have specific meanings and that's why I'm trying to sort of bring that out here. All right. So let's go to a better example that's gonna illustrate the case for peers. And instead of going one to 10, I'm gonna just go to one, one, two, two, three, four, four, five, five, okay? So I'm gonna interject duplicates here. So if I now do my run just like before with the duplicate clause, the parentheses, I get each row, I get each row seeing all 10 rows. And instead of totaling to 55, the first this set totals to 30. So 30 is our 55 now, okay? All the numbers, one, one, two, two, three, all the way to five, that totals to 30. So this is a query order by X. Now I haven't remembered the defaults, right? So this is really ordered by, as really this, okay? Order by X range between I'm about seeing current row, but current row because we're in range mode and the order by means the first or last peer value. So I'm gonna back up why I get these numbers. Why do these jump by two like that? That's because each row can see all of the rows which it, where it is peers of. So the first row can see the second row. The third row can see the fourth row. The fifth, ninth row can see the 10th row. And when you look at the totals, one plus one, two, you get two for both of them. Six, six, two plus two plus one plus one for both of them. Okay? Again, 12, 20, and here's our 30 right here. But for both rows. Because unbounded proceeding to current row, current row is the last peer value. It's not my current row. It's the last row which has the same peer value, same order by value as I do. So it goes all the way for this row. It sees this as the end of the current row. We're good? Whoa, what happened? This was range mode. Same query. This is rows mode. What's happened? All of a sudden, I don't see the jumping by two anymore. I see literally one, two, three rows, and there's our 10. And my numbers are different. Because in rows mode, current row means the literal current row. Yes, sir. So the question is, can you say order by X rows and not put anything for the start and end? I'm gonna have to go back to the syntax. Looks like you have to do. It looks like you cannot do that. It looks like you have to say rows, unbounded proceeding, or un-rows current row or something. You can't just say rows and stop. I don't know why, but again, that looks like what the SQL standard. I've grabbed this right from the Postgres doc, so I would be surprised. If this was optional, we'd have a bracket here. But because you have a brace, it's gotta be one or the other. I don't know why it was that way. So you can kind of see in rows mode how that's working. As soon as we switch into rows mode, current row was not my peers anymore, and I start to stop my frame at the current row, literally. I can do this, order by X range current row. Now this is interesting. I'm in range mode here. So what this is really is range current row to current row between current row and current row. And in fact, we each row sees two rows, right? That sees two, one plus one. That sees four, two plus two, six, three plus three, four plus four, right? It works. Actually it has started actually some logic here, right? If I switch into rows mode, here's range mode, if I switch into rows mode again, each row just sees itself, and you're just seeing itself or something. But when I was in range mode, I saw only my peers. But as soon as I go into rows mode, there's nothing there. Okay, any questions? Okay, partitions. We talked about partitions a little bit. Again, by default, the partition is its own, it's a whole set, but you can partition your data. In this case, I'm partitioning by X, and what effectively I'm doing is I'm making each value its own partition. So this is one partition, this is two partitions, this is three partitions. And you can see each row can only see its own peers, because I've created partitions. I've got five partitions here, okay? And it basically ranges, it's basically this same specification, okay? Now that's kind of boring, so what I'm gonna do is I'm gonna break it up. I'm gonna say one partition is everything less than three, and the other partition is everything greater than or equal to three. So here's everything less than three, here's everything greater than or equal to three. Okay? So less than three is red, greater than or equal to three. Now, remember, I haven't specified anything after this. So what is my default range between unbacked preceding and current row? In range mode, there's no order by, in range mode, current row is the end of a partition. So now, to answer the ladies' question, effectively when I do a sum here, I am summing all of the values in marked partition. Okay, I have four rows, and I've got four values here. Here I have six rows, and I can see all six up. Okay, and in fact, the total of these is 24, right here, the total of these is six. All right? If I add an order by, well, as soon as I added an order by, again, my default frame is range between unabated preceding and current row. I have an order by, therefore, the current row is the last value of my peer group. So effectively, the first two rows can see, the first row can see two rows, the second row can see two rows, the third row can see four rows, the fourth row can see four rows. And the same thing down here, starts out from here again. Yes, sir? Yeah, so the question is, in one case here, I'm just passing a value, I'm just ordering by a column value. And the other case, I'm creating a Boolean. Effectively, what I'm really ordering by is zero and one. Again, this is postgres, we have a Boolean type. So effectively, when I order by an expression, I've really ordered, I'm sorry, when I'm partitioning by an expression, I'm effectively just partitioning by zero or one. That's correct. That's correct, that would be just too much. Now I could do modulus 10, and then I got like 10, nine partitions, 10 partitions, right? Because then I would be doing a modulus, so it would be, modulus 10 would be, so I could put an expression function column there, I could do some kind of arithmetic. In this case, I just used a column and a Boolean, but I couldn't put anything in there. I could put a factorial, I don't know, I mean, whatever I wanna put in there. I could do alphabetics, I don't even have to use numbers, I could put like letters, I think. I'm sorry? So can you do it inside of a range clause? No, it has to be, you can do it in a partition, and you can also do it in an order by, you could order by an expression if you wanted to. So partition by x in, yes, you could use the in clause, I'm sorry, you could use anything, you could use an S2L effectively, a function call, yeah, it could be any expression. Well, it depends, oh, that's right, because in would only return to and fall, so in that case, you would only get two partitions. Yeah, thank you, yes sir. Right. So the question is, if you had a mismatch partition on the order by, how is that gonna behave? I believe the order by applies within the partition itself, and it does not cross partition. Peers have to be in the same partition. In fact, by definition, I guess I didn't make this clear, you can never cross a partition with a window function. They're like, just like optimization sense for common table expressions, they are boundaries which cannot be crossed. And I'll show a bunch of examples later. So effectively, every time you cross a partition boundary, you are starting with a brand new set. And again, this expression is this, and now it makes it a little more sense. Current row, order by, it appears, and then you can kind of see how it looks. Now if I go from range mode to rows mode, right, see that? Rows mode, well, it's one, two, three, four, one, two, three, four, five, six, nothing exciting here at all. Because the order by doesn't have an effect on current row, because it's just the current row. So the order by is kind of immaterial here, because I'm not really using any peer groups. There are some other reasons to use it, for example, for some of the window functions, I'll show you later, but in this case, there's no value, there's a particular query. Questions, yeah. So the question is if you, if the rows were in order and you didn't have the order by clause, would you still get the output and the same result? The answer is no. One of the things that I'm not doing here, which I technically should be doing, is to add an order by outside of the window function. Okay, the reason I'm not doing that is for clarity. Because I'm only using one order by, the output by definition comes out in the order that I specified. It doesn't, the specification, the SQL standard doesn't really require it, but that's why Postgres always operates. So effectively, when I only have one order by, I always get the results out. What you can do is you can have multiple window functions with different order by's, and you have the same query using multiple order. You don't know what result, out order you're getting the result back in. And almost every, all of those cases, you're always gonna have an outside order by, and I would always recommend that you always have an outside order by to specify what you want it to come out at. No, because the order, I say, if you, that, okay, so the question is if you didn't have the order by, would the sums appear the same? And you're kind of right. If the reason, you're kind of right. The reason that current row means unbounded following by the fault, without an order by, in range mode, which is the fault, is because a set that doesn't have an order by effectively has no definition of where the current row is. So how would you really know where it's supposed to be? So they kind of gave you the rows mode and kind of override that and say, I don't care, I'm specifying rows mode, whatever order the rows come in, that's the way I order, I want to process them. And current row can mean current row, but that's why I've had some discussions in different, I've delivered this nine or 11 times since I wrote it, and one or two times people would say the reason those defaults are there is because the sets are not ordered unless you give an order by and therefore the meaning of current row wouldn't have to make any sense of specifying the frame. And that's why you have to say rows mode when you really want to override this non-ordered value of the set or add an order by in there and then the rows mode, then there is an old meaning for current row because you have ordered it. So I'm kind of backing into the answer, but effectively that's the reason the defaults are the way they are. Even though they're practically inconvenient, that's the reason that almost every case you're going to want to specify an order by, even if you're using rows mode because you want to make sure you're getting the data in the order you need to process it. And when you start to mix different, you got multiple window frames in the same query and then everything else, everything goes out the window because you don't even know what order you're getting the results back in. So that's a lot more detail than I normally get, but you're absolutely correct. There's a reason that the default is there is because the ordering is indeterminate without an order by. Window specific functions. So up until this point, I've shown you only things like count and sum and average, things you've seen already, but there are a bunch of window functions that are designed just to be used with windows, a bunch of aggregates that are designed just to be used with window functions. Very, very powerful. So first one, row number. All it does is to number the results. So here's my table and here's the row number for each one. Now, one aspect you should notice here, row number takes no arguments and operates on partitions, not window frames. I have a slide later that summarizes this, but every time you look up a window function in the manual, it will say operates on partitions or operates on window frames. There's a whole bunch of functions that don't care what window frame you define. They operate on partitions. So you can go range mode, rows mode, unbounded proceeding until the cows come home and you're gonna get the same result from that function. And that's what I was saying. When you're reading the manual, be very clear what you're looking at, okay? How that particular thing operates because they're gonna be different, all right? Another one, lag. Lag is a row before my current row, okay? So in this case, one, there is no row before it, so it returns null. All the other rows just shift them down by one, okay? So this is, again, a window-specific aggregate. It can only be used with an over function, over the window clause, window function, okay? And it's always the row before, all right? And you can do lag too, give me two rows before. And again, returns null for the rows that don't exist. I can do lag and lead. So here's the rows two rows before. Here's the ones two rows after. And the same qualification I gave you earlier, these operate on partitions, not window frames. Doesn't care about rows mode. Doesn't care, cares about order by, okay? Doesn't care about between or unbounded or any of that stuff, just doesn't care, okay? And you can define the false for non-syster rows. First value, last value, give me the first value within the, help me here. What? The frame, right, right here. This is our first window-specific function that deals with frames, not with partitions, right? All these other ones operate on partitions, this one operates on frames. I have a slide at the end. Don't have to remember it. Big slide of the old stuff in one slide. I know people look, they pull up my presentation just to see that one slide, right? Because you forget. But this one operates on window frames, so in fact, the order by the frame I specify actually makes a difference, okay? Now why is that important? Because look, first value, they seem to work just fine. Like first value, it knows, the first value is one. Last value, not so much. Why is it doing this? Somebody want to explain this? What is the default? The default end frame. And that's why, exactly. Unvetted preceding is the beginning default, but the last row is the end. The end of the frame default is current row and current row in range mode with an order by or my peers. So literally, this number is really that number. With this number, right? Because it's seeing the end of it. And that's why it's kind of hopping along. As the window frame gets bigger, as current row moves down, my last row, I can see more, my window frame is getting bigger. Yes, sir. But if there's no order by, the whole thing, yeah. If there's no order by, then this is all gonna be fives. Or if I go into rows mode, right? And I do unvetted preceding and unvetted following, I get fives, okay? So again, if you're using last value, keep in mind it's using frames, you may want to specify this kind of thing. Because you probably don't want this for last value. You might, but just be aware of what you're asking for. This is why I'm saying don't use the defaults and write what you want because the defaults can trip you up so much, okay? Here's another window function, ends value, it gives you the ends value in the set. So these operate on frames, right? So the first two rows don't see a third row in their frame. Only at the third row does it see a third row in the frame. Same thing over here, it doesn't see a seventh row until row seven because I'm using the false. If I do that range between unvetted preceding and unvetted current row, I get this, this is my default, I get the same thing, right? But now if I go rows between unvetted preceding and unvetted following, bam! I see the third row right away, I see the seventh row right away, right? Be aware of what you're asking for. Every time you use a function, is it a window function, is it using frames, is it not using frames, is it using frames, what's my window frame definition? Rank and dense rank, these are cool, not so much in this slide. What it's trying to do is rank my values. But you know, without an order by, there is no rank. Even though it's operating on a partition, it kind of doesn't know what the order is, so it's kind of useless. But, and even if I specify the frame, it's operating on partition so it doesn't matter, it doesn't make any difference. If I go to rows mode, it still doesn't make any difference, okay? But when I add an order by, now this starts to make sense. What it's trying to do is to take my values and rank them. Now, there's two types of rank. If two people cross the finish line first, at the same time, they're both first. What's the next person's rank if it goes across the finish line? Or two, right, both answers are right. There's three, if you consider that the two people who went first really take the first and second spot and the guy's third, he literally is third, right? Because I had two before him, huh? Or, I have, or I might want to not have a gap and like, I want to have that person second. So, rank will basically leave a gap. These two people finish first because they got the first thing, then third and fifth and fifth. This one, I got two first place people, I have two second place people, I have two third place people. That's dense rank, it's rank and dense rank. Percent rank and cumulative distribution, these are trying to give me a sense of what percentage of people are less than me or what percentage of people are equal or less than me. The gory details are down here. Effectively, one of the odd things is that with percent rank, it excludes yourself. And there's some reasons for that, particularly because if you have one person who's highest, there's really no one less than him. So, he's really greater than 100% of everybody, but if you count him or her, then you're like, that's not really 100%. So, there's all the goofy things going on. Cumulative distribution's a little easier because you include yourself. So, therefore, 100% of the people are equal or less than me. 80% of the people are equal or less than me all the way down. I put a blog entry about this. In fact, I don't really like either of these. I'd love to have a function that they considered everyone less than me and half of the people equal to me. I think, if I get an SAT score, I want to know how many are less than me and then let's assume half of them are less than me, right, that would be a better number, but nobody's ever, there's no name for that and we don't support it. I guess somebody could write a version of that, but that's just, again, I have a blog entry about it. Finally, for Intel, I'm trying to break every up into three parts. So, here's my first part, here's my second part, and here's my third part, turn the chunking my groups together, okay, again, based on the order by. If I add a partition, again, same partition we used before, what I get is here, these people are first, these people are third within that partition, then the rank starts over again. This is a great illustration of the way partitions work. In the new partition, these people are first, these people are third or second, okay? Partitions you never cross partitions with these window forms. Here's another example with percentile rank, cumulative distribution and entile, again, using partitions, again, starts off at zero for the second part. Okay, any questions? So now we're getting to the meat, going through all that sand and jellyfish and everything. And now we're gonna actually look at an actual example, and we're gonna apply window functions to this example. Okay, I'm gonna show you analytics, and I'll show you why you do it this way. And again, the idea is you're not gonna be able, you're not gonna be copying a window function to someone else, you actually, by the end of this, are gonna be able to use and write your own window functions knowing what all these things mean, knowing what all the faults mean, knowing whether your functions operate on frames or partitions and what order by does and all of those stuff, so you're not gonna get surprised. We have an employee table, and we have one, two, three, four, seven employees, okay? Again, they have an apartment and they have a salary. It's a very traditional example here for employees. And we do a select on the employee table and we effectively have seven employees in different departments, yeah, yeah, yeah, yeah. Now we do an area. Now this is not a window function, we're just gonna do a traditional area. So count how many employees do we have seven, total salary, this, average salary, that. No window functions here, this is just straight areas. I can do a group buy, I can get the number of people in each department, and I can get the sum and average for each department. Wonderful, but again, not a window function. I've lost the distinctiveness of my employees, they're now grouped together into a flush of one line per department. And we talked about the idea of window functions giving us the ability to maintain distinctiveness of these rows, that's what we're gonna show you today. Here's another one, roll up. This is still not a window function, but it does give me the option of getting totals by department, and then a final line which has the roll up of all of the rows together, still not a window function. So here's our employees again, ordered by salary, and here's our first window function. We have the names, we have the salaries, and then we have the sum. The sum with the notorious open, closed default number there with the sum, and remember we used to have 55 all the way over here, now we got 40,200 all over there, right? That doesn't look very useful, does it? Right, we maintained the seven distinct employees, and now we got the same salary on all of them. What good is that? We can actually make some use of that. Here's an example, we can take that window function and we can divide the person's salary by the window function and we can multiply it by 100 and round it to do decimal points, and now I can see the percentage of all salary for each employee, isn't that cool, right? Instead of showing 55, yeah, okay? But instead of showing 55, now we can actually use that number and get a percentage. And here's an example where we actually take, we actually do cumulative totals. Now, it's not very useful in this case, but what we're doing is we're using ordered by, I'm about to proceed into current roll, and because we're in rose mode, we're effectively moving down and it's getting higher each time. Now this is probably useless, but if you did have like cumulative totals for a time series data, like how much rain this month and each day you want the total for the month, that would be a good example, okay? Keep in mind, I am using rose mode here. When I originally wrote this query for this talk, I used it a false. Why is that wrong? Bingo. Andy and Sandy have the same salary. Therefore, they would have the same sum. That is probably not what I wanted. Somebody pointed that out during a talk. Yes. Yes, I had to apologize and I fixed it and I said it was up in the next hour and I got bitten because I was so focused on showing you examples that I didn't do what I told you about showing. Very interesting. Learn some. Here's another one. This is a window function showing the average of each employee. It's useful, but again, it's the average on the same thing. What good is that? Well, as you might imagine, I can actually subtract. The person's salary's from the average and everyone in red is above average in salary and everyone in blue, unfortunately, is below average in salary. It's easier to do this than trying to do it in my application. What about this? Instead of comparing the salary to the average, I want to compare it to the next salary. The next salary forward, okay? Because I'm more earned by salary descending. So how much more is Mike making in the James? How much more is James the Betty? How much more Betty than Sandy? How much more Sandy than Andy? Okay, how much more Andy the Tracy? How much more Tracy than Cal? We get the idea. Okay, so instead of comparing it to the average, I'm comparing it to the next person who has the next high of salary. Yes. So shouldn't Sandy and Andy show the same thing? Hmm, you don't get it. These operates on partition. So they have no peers, it doesn't matter. They would always have different. Because with lag and lead, you're literally always accessing the next value. So yeah, it wouldn't, the next peer group now. And you could do that maybe with partition buy or something? I'm not sure. Maybe that would be a way of going at it, yeah. I don't know what these numbers are. They're like just, I don't even, I don't know if they're monthly or yearly? I don't know. Pick something that rounded well. Because they come to nice numbers. So now this is interesting. This is kind of going a little meta analysis. We're comparing the person's salary, how much more that person is making than the least salary employee, okay? And then we're making it how with percentage they are more than the least salary employee. You sometimes see this type of analysis for corporate environments. So here, this person's making 25 more than the least person and 54% more than whatever. You get the idea. And I have used a window clause because I needed last value, which is a frame clause to see this bottom value. I can do rank and dense rank. So here's the rank of the employees. You'll notice that Andy and Sandy make the same amount of money. So Tracy is either six or fifth within that set. Ready to do rank and dense rank, right? Makes sense. I can do departmental average. I can compute the person, their salary, the average of the department and how much more or less that person is making than the departmental average, all right? And I'm doing that by creating a partition by department here and then I'm computing the average within that department, but I'm computing their salary minus the average, all right? Kind of interesting, yeah. I don't know what that question is, I'm sorry. I can't understand what you're asking. You could do it in the where clause here, I think. Because these over clauses aren't really affected by the where clause. Yeah, yeah. They're sort of independent inside of there. But the question is yet, could you exclude somebody and still have them in these calculations? Absolutely, you could. So here, instead of saying partition by department to partition by department, I can create a window called D and then specify D in these cases, right? Kind of pull it together a little bit. I know we're, again, I was gonna need those extra five minutes, so just hold on here. We are very close to being done. So instead of comparing the person to their, to the department average, we can compare the person to the next salary within that department. Not straight to the average of the department but the next salary at the department. So, Mike is 800 more than Betty, Sandy 800 more than Carol. But again, James, 1200 more than Andy, Andy 800 more than Tracy. So you get an idea of how they're breaking up within their departments, okay? Again, comparing to the average department, comparing to the next high salary department, different numbers. This one is really interesting. I can find the person's rank within their department, using F, and also the rank within the company. So here is the rank within their department, here is the rank within the company. And a good example of using different window clauses in the same query. Okay, any questions? So I just want to end with sort of an overview of things you need to remember, kind of to kind of gel it in your head and we can all head out. First, when you're setting up a window function, do you want to split the data? You're gonna need a partition. Do you want an order to the partitions, to the value in the partitions? Okay, use order button. How do you want to handle rows with the same order by value? Range mode or rows mode? Rank or dense rank, right? A lot of times people writing window functions get caught up and don't test for duplicate values. Duplicate values show up, they get the wrong results. And you don't want somebody waving a sheet of paper in front of you saying, this number is wrong, and then you have to figure out why and you've forgotten something that has happened to me. Do you need to define a window frame? Maybe you're gonna use them all over places. Window functions can define their own partitions in ordering in window frames. So you can have multiple different window frame in ordering partitions in the same query. They're independent, okay? Model window names can be defined in the window clause. You can define multiple times and then pay attention to whether a window function operates on a frame or a partition. The documentation is gonna tell you that. So this is the slide I promised. It basically shows you that some functions operate on frames, some operate on partitions. For the frame example, some of them are generic areas like some in average. Some of them frame operate on row access, first value, last value, and end value. Again, operate on frames. These functions all operate on partitions. Again, some of them are related to row access, some of them are related to ranking. And again, feel free to get this slide when you, next time you have to roll up your sleeves because this is a nice handy thing to have. In terms of what we have implemented in Postgres 10 and Postgres 11, we are adding some things to Postgres 11. Effectively, we are allowing a range to specify peer groups who values are plus or minus the specified proceeding or following offset. What that effectively means is you can go backward or forward based on how many peer groups are before you. If you looked at my example, every time we did following and proceeding, we were in rows mode. If you do following or proceeding in range mode, that's what the behavior you get. And you'll get an error in Postgres 10, Postgres 11, that'll work. Also, we now have a groups clause which didn't exist before in addition to range and rows and that allows me to specify the number of peer groups plus or minus the current peer group. This one is plus or minus the specified offset in terms of a literal value, like a numeric value, plus or minus 10. This one, how many peer groups are before my current row? And finally, we have a frame exclusion clause which you can now use after your partition frame, which allows you to exclude things like the current row, exclude your current peer group or exclude other peer groups to sort of filter out different results, okay? So that is it. Again, I apologize for going a little over, but I wanted to give you a good 20 minute break. I also didn't wanna cut off any of your questions. So I think we've had a good discussion. It really hasn't worked well in terms of allowing you to understand it and hopefully you're gonna go home and make better use of your SQL engine to do analytics, to do imperative kind of operations with common table expressions with a lot more confidence. So again, these slides are on my website. Feel free to download them anytime or take a look at what else you have and I hope you enjoy the rest of the conference. So thanks.