 that marks those deleted rows as available space for future rows. It also does a whole bunch of other maintenance stuff in the background for Postgres as far as data visibility and things like that. But it's much faster for Postgres to vacuum a small table than it is a large table. Easy as that. So it also just reduces the transaction time of the vacuum. So if you end up having a multi terabyte table, it's gonna take a long time to vacuum. You have broken up into daily tables, into daily tables. Vacuum is probably gonna still actually take longer than it would have doing it on a one table, but it's broken up into several thousand chunks and transactions instead of just doing it as one giant transaction. And the other thing is, by doing this, you can actually, if the old data isn't being touched anymore in those old tables, vacuum can just completely skip over it. So you're not actually vacuuming all those millions and millions of rows of tables every time it has to vacuum the partition set. It's only vacuuming the more recent tables that are actually getting updates and stuff to them. So that's another big reason for it. And the last thing is, it is career performance. As tables grow inside, read performance can be impacted. But this is really on really, really large tables is where the career performance actually shows up. But it avoids larger index scans and table scans. You're scanning smaller chunks at one time for things like that. And we'll talk about partition pruning as well. It can help the career planning be more efficient. It's able to skip over some rows or skip over some tables if it doesn't need to touch them. So any questions about that? Sorry, it's a little chilly in this room. So the old way in Postgres. So back before Postgres QL 10, if you go back and look at the documentation, you could partition before partitioning was built into it. It was just a lot of manual work. You made use of three primary features in Postgres. All these features are still there in Postgres. You just don't need to do it for partitioning anymore. It's table inheritance. It's where one table inherits another table and it inherits the column structure and defaults and all those kinds of things from that top table. Triggers would route that data from, instead it would be a trigger on the parent table and it would just be like the interrupt to say, don't write to this table, go right to this table instead. And then constraints, which would constrain the data inside the appropriate child table and not let it be in any other table as part of that purchase and set. And you had to write all that manually or write some custom automation for it. And that's where I ended up writing the PG Partman extension. Because around that time, 9.1 had come out and I was tired of writing the same thing for the third or fourth time for another customer. So that's why, and I'll talk more about that later. There are still, there's very, very few. I haven't come across any more myself. But there may be some narrow use cases where you may need to do this old method of partitioning. Back in Postgres 10 when it first came out, there was a lot of those places where you would probably still need to do the old trigger method. But as of Postgres 15, 16, and more recent versions, there probably is a reason for it out there. I haven't seen it myself. So if anybody else has it, I'd be curious to hear about it. And I'd mostly be curious to hear about it because that's probably another thing that we had talked to Core and see, hey, can you fix this so we don't have to do it the old way anymore. So the new way, you're looking through the documentation. I used to call this native partitioning a lot but trying to be with what the documentation is calling it, they call it declarative partitioning. So that's why I'm trying to call it down myself as well. Mixed use of SQL syntax commands, we'll see those here in a few seconds. There's three types, and I'll talk about these three types as well, range, list, and hash. And the built-in methods, which are called tuple routing and partition pruning are far, far more efficient than triggers and constraint exclusion and things like that. A vastly noticeable degree. I would encourage you if you're really curious, go write a trigger. Go write a trigger to route your data from the parent to the child and then do partitioning. You'll see a drastic difference. So this was a huge win for Postgres to get this built into it. So the first type, range partitioning. And you can partition by one or more columns. Usually in this case, you're usually just doing it by one with no overlap of the data in those ranges. So just as an example here, we have a measurement table. You'll notice the parent table, the measurement table, all we're doing there is telling it what the partition column is. We're not telling it how to partition, what the interval is or anything like that that's not defined on the parent table. You'd actually define that on each individual child table, which is the next two lines down here. So this is our partitioning by month. So we see the one's limited to February, the next one's limited to March. You'll notice the upper, the lower and upper boundaries. It's inclusive of the first value and exclusive of the upper value. It's always like that. You can't change it for the built-in partitioning. That would be one of those narrow use cases. If you needed it to be inclusive of the upper range, you'd have to write it yourself. I don't know why you would need to do that, but that would be the one thing you can't change. And then at the bottom, you can see just doing a description of the table. If you do a D, the D plus, you can see it'll list you all the child tables and it'll show you the conditions. The name of the child table, completely arbitrary. It can be whatever you want. Most people follow some sort of a naming pattern like that just so they can figure it out, but the name of the child table, completely arbitrary. It's more the conditions of it that you have to watch for. So, any questions about range there? Yes. Oh, it's asking what data types you can do. Anything you can define a range for, but it's probably gonna be time, integer, numeric. Basically anything you can define an upper and lower boundary for. It's not really limited in what I, what's that? GUID, as long as there's a comparison operator of one thing being greater or less than than the other, then yeah. There are, people do partition by GUID. I'll actually talk about that in a little bit here. So, as long as there's a way to compare one value to another and one value can be greater or less than another, then you can do a range as far as I'm aware. What's that? Yes, yes, so for March first, that would go in the partition that's set for March, because it's inclusive on the beginning and exclusive. So, the exclusive boundary on the upper end, well that lets you account for every single possible microsecond all the way up until that point. So yeah, same again. Okay, yeah, yeah, yeah. Yep, so the other method is list. This is where you have an explicit listing of what, you explicitly state what values can exist in each individual table. You can see this little, I think this is the one that's actually in the documentation or it used to be. You can actually do a function on that. So this is just like, it's partitioning by city names and we're actually doing the init cap function just initially capitalizes every letter on whatever the name is and we just have this one thing here for cities in the West, Los Angeles and San Francisco and then we just have one checking straight in there just to make sure the city ID is not equal to zero and then just have the one child table in this example here. The slides will be available for everybody as well. I'll have a link at the end and I think I can upload them to the schedule later. So yeah, list is very simple. Just, you're telling it explicitly what goes in each child table. And the last one is hash partitioning. This is when you want to partition on like a randomized set of data or that's like growing evenly. That's important, growing evenly. Or you don't know what your data distribution method is gonna be in advance. I very rarely come across people that don't know what their data distribution is going to be in advance and if you know it, please don't do this. You'll see why. So the way it's explicitly defined and then you actually have to define it, it uses modulus. There's one hash function that's built in and that's the only thing that's supported as far as I'm aware right now. And the way it does, you tell it, the modulus is like how many child tables are gonna be and then the remainder is usually just gonna be zero to one less than whatever the modulus could be different. Generally, this is what it's gonna be though. And you can see this is what the parent table looks like then and you can see at the bottom, it's a, that you're, oops, I can't, I can't select text in here. On the bottom there, there's, you can see the partition constraint right around here. It's just a function. It's a black box function. If you really, it's not really black box, you can go look at the source code and see what it's doing. But for most people, it's like a black box hash function in most cases. And then, so just as an example, so this is my hash partition table. I'm just gonna throw a bunch of random usernames in there. And this is, you can actually do a group by function and see what the contents of all of them are. So like I said, if you can identify the column the partition by, range or list are gonna be much better long term. Because in hash partitioning, you cannot add or remove a child table without recreating the entire thing. And the data often, and you can even see in this example here, the data's already unbalanced in partition one. It's already got way more values just in this tiny little data set. So it often becomes very, very unbalanced unless your data actually is random. And people will say, well, what about UUIDs? Go look at UUID 7 or ULID. That actually is a UUID with a timestamp built into it. You can try to do that. I would recommend looking into that instead. Or just don't use UUIDs. Just use regular composite primary key columns or something like that. So any questions about range, list, hash? Yes, a list, yes. No, no, that was just, yeah. Yes, no, no. You have to make a new table. Yeah, there's no way to add or remove without making a brand new table. Yeah, so just a note about the, there's a feature and it's relatively new now, the identity feature in Postgres. It's basically the SQL standlard for managing sequences inside the database. If you ever used sequences in Postgres before, there are still sequences in Postgres. And you can attach a sequence to a column, make it the default value, all that kind of stuff. But then the sequence itself is independent of the table. If you've ever added a sequence to a table after you've made the table, you'll know that the permissions for that sequence are independent of the table. So you go to write to the table and you get an error. You're not allowed to use the sequence. So that's a common thing around sequences in Postgres. This ties the sequence to the data. If you change permissions on the table, the sequence can solve all the permission changes and all that kind of stuff. This also better and allows you an enforcement of, like you don't want to let people set manually set the value of the column. You can do that with a sequence and then go write a constraint to not let people do that. Or you can just use an identity and it automatically does all that for you. And that's really the main reason. It's also easier to remove. So when you drop the table, the sequence is dropped along with it. And if you need to use a sequence across multiple tables, this isn't gonna work. This basically ties the sequence to this table, but it's very, very handy. As far as partitioning goes, this does not work with, if you do the old trigger-based method, identity really didn't work with that. And the identity will only work if you're entering the data through the parent table. If you're trying to insert the data directly to the child tables, that will bypass the identity. But I would highly recommend looking into this if you make a lot of use of sequences. Like individual sequences for individual tables. Highly recommend looking into the identity feature. So you can update data in partitions. That's, oh, I'm sorry, there was a question. Sure. I would question why. It's kind of a decision. He's asking if you have to insert something around the identity. The entire reason of doing this, if you have to do something outside of the identity, you shouldn't have been doing the identity in the first place. This is like an enforcement. You're not supposed to be able to do that. You can set the identity to allow values, like if somebody manually does it, but then you may get an error when the identity goes and then tries to insert that value. Because you're interrupting, you're entering a value in the sequence that it's trying to get to. So it all depends on what you're trying to do. I would say this is like encouraging you to not do that. The only thing that should be setting the value is the identity. If you're trying to go around that, I would kind of question why in the first place. That would be my question, yeah. Yes. So yeah, so he's asking what's the question, what's the difference between big serial? So big serial is like a, it's not really a data type in Postgres, it's a convenience. When you set the data type to big serial in Postgres, it just automatically makes a sequence for you. The column just ends up being, in this case, a big in. It just makes a big in column, automatically makes a sequence for you and ties it as a default for that column. But that is an independent sequence. Something else can go and modify that sequence. If you, the permissions on that sequence are not managed. When you do big serial, and then go try to let somebody else use the table, you have to grant the permissions on the table and the sequence. So yeah, yeah. What's that? For yes, you're saying for update, for update you need to grant the permissions. If you're just reading the data out, because yeah, yeah. Yes. Yes. Right, right. Yep, thank you. So for updating data in Postgres, as a Postgres 11, this is just built in with support. If you update a row and the partition value would be moving that data from one child table to another, it just automatically does that for you. If the child table's available, if you go to do an update and the child table's not available, you'll get an error back because it has no place to put that data. There's a little bit more overhead in this, obviously. If you're doing an update within the same table, Postgres has some optimizations it can do as far as managing the space and that kind of stuff within a table. So this will not be as efficient as an update within a single table, but an update is still supported. It's transparent to the user. They won't even know that's happening. Just be aware, there's a little bit of, it's the same performance penalty as if you manually deleted the row from the table and inserted it into the other table. That's all it's doing on the back end. There is some, there is upsert support. There's no upsert command in Postgres. The command's actually insert on conflict. It's fancier than an upsert. There's some limited support. I'm not gonna get too deep into that. The Postgres documentation actually gives you the situations you can do an upsert with a partition table. But it's mostly around the, if you do insert whatever on an on conflict instead do an update. If there's a matching unique constraint on the partition key, it just automatically updates it appropriately. But there's a lot of other things you can do around insert on conflict and they're not all supported in partitioning. So just take a look and test it out with whatever your use case is. There is a default partition. This is also added in Postgres 11. There was a huge, and this is kind of like the way Postgres' iterative development works. There's a very basic partitioning was added in Postgres 10, almost to a point where I didn't even recommend people using it. But it was there. The syntax and stuff was there. It was actually easier to still do trigger-based partitioning in Postgres 10. But Postgres 11 and 12 both vastly improved partitioning as far as built-in declarative is going. So that's where the update was added in 11. Default was added in 11. Index inheritance was added in 11. A whole bunch of stuff. So this is the situation where you don't want to lose data. Normally if there's no default child, if there's no default child, somebody tries to insert data into the partition set that doesn't have a matching child, it would throw an error. It just wouldn't do anything. If it has a default, it will throw that data into the default. If you actually go and look at the default table, it basically has an anti-constraint. So it looks at all the existing child tables to see what the constraints are for them and it does an opposite constraint on the default. It's not that it's allowing anything to go into the default. It's allowing the opposite of all the existing children to go into the default. So then when a new child table is added, it automatically updates that, or if you remove a child table, it automatically updates that anti-constraint. You can look at it. It's like if you do like a slash D on the default, you'll see it written out. It's the opposite of all the child tables. The thing along with that though is if there is data in the default and you go and add a new child table with a new constraint, if there's data that matches that in the default, you will get an error because that data can't exist in two places at the same time. The default has that constraint defined already. You're trying to add a new child table with the same constraint. So to fix that, you have to move the data out of the default first, add your new child table, then you can move the data back into that new child table. That's how that works. I do not recommend relying on the default to keep data. You wanna keep it monitored and keep the data out of the default. It's this just there to catch accidents and you need to go in there and fix those accidents or just not have a default. If you don't want data that doesn't match a child to exist in a partition set, don't have a default. The user will get an error back. So I actually learned, I didn't learn this the hard way. Another user of PG Department actually found this out the hard way when he reported an issue for it to me. We couldn't, he was running the maintenance to add new child tables to the partition set and every time he added a new child table, it locked the entire table for five to 10 minutes. Like completely right locked, right and re-locked, completely locked it out. And that was because every time you add a new child table, Postgres has to scan all of the data in the default. He had two billion rows in the default. Yeah, so it basically, it was doing an index scan. He had an index on it, so it wasn't doing, thankfully it wasn't doing a sequential scan. Still takes a while to scan two billion rows. So he cleaned the default table up and then new child tables were added and removed fine. So you definitely want to keep it, if you have a default, have some sort of monitoring in place to keep an eye on it and that data's not getting built up into it. Because it's not going to throw errors. It's just going to keep building up data into it. Yes. Took all the data out of the default, put it in the appropriate child tables. PG Department, I'll show you. There's ways to do it later. Yeah, yeah, so you can just migrate the data out and just like, you basically have a temporary table. You can move the temporary table or another real table. And in this case, he just did it with a real table. He moved the data out to another table, made the new child tables and then moved the data back in. So yeah. So what is partitioning gain you as far as query performance and stuff like that? So Postgres has had a thing called constrain exclusion in it for a while. From the user perspective when you're looking at it, it looks the same as partition pruning, but partition pruning is far more efficient. What it does, so this is my measurement table from the range I studied it before. I'm doing a condition where city equals five. That doesn't match my, I'm partitioning by time. So that's how it's partitioned by. It actually has to go through. There's actually, it's, where's my mouse pointer here. So you can see this is February, then first, then February 2nd, and then it goes all the way, it's like a little dot, dot, dot there. And then all the way up to February 7th, it was index scanning every single child, every single child table because it doesn't know which of those child tables had a city ID less than five. It had to check all of them because you didn't put a condition in there that matched the partition constraint. So what happens if you do do that, now we're going up by log time, you can see it's only scanning tables that match so we're looking for everything less than February 4th. So it was only looking in tables that had data that were, that was less than February 4th. It skipped all the other tables. Still the sequential scan, but it's doing a sequential scan because you're asking for all of the data in this case. So this is very, you'll, I noticed this myself. You'll notice that the query plan is actually slower in the second one. That's only because there's, this is a very, very tiny table. She's actually did this with real data. You'll see a vast improvement in the planning time in most cases, but it can help with this. So this is what constraint exclusion has always done and still does in Postgres. What constraint exclusion does is it actually goes, instead of scanning the data in every single one of the child tables, it just goes and looks at the constraint. That sees what the rules on the constraint are and then skips over the table if it knows that data can't exist in there. In that case, it still has to check the constraint of every single child table. Partition pruning, it's aware of the entire partition sets rules. So it just, it knows what the rules are, so just goes to the tables that match the rules and that's it. It's a lot faster. So any questions on that? Okay, okay, no problem. So I said I'd talk about what's coming soon in partitioning. There's not a lot left, really, as far as like base partitioning features. You'll see some stuff that there are some, I consider them more bugs, not coming soon features. I'll talk about them more later that I kind of work around. But as far as like the base features inside Postgres, the only one that's really missing is global indexes and I'll talk about those a little bit more because I also kind of consider it a bug a little bit myself, too. But I looked through the commit fest, which is like the list of everything that's currently being worked on and looked through the mailing list for the last few months. The only thing that's really mentioned is query performance improvements on the backend and those are always gonna be there forever in Postgres. This one in particular is actually a pretty huge performance increase for partition sets with a large amount of child tables. So this is currently in the commit fest, doesn't mean it's gonna be in Postgres 17. It's just actively being worked on right now as far as a patch. I'll talk about global indexes a little bit more later, but that's really the only thing that as far as big features in Postgres, most of them have already been added. Like I said, index inheritance, all that kind of stuff has already been kind of handled. So, talk about PG Partman here a little bit. Like I said, I originally made this back in Postgres 9.1 when the only way to do partitioning was the inheritance triggers and constraints. And like I said, I got tired of writing the same thing over and over and the extension feature had just been added in 9.1 so it was a nice happy coincidence. That has decided to write an extension for it and it's gotten pretty popular now that I've learned. So, but the declarative partitioning now manages the triggers, constraints, and inheritance and all that kind of stuff for you. So, why do you still need Partman? Doesn't that kind of handle everything? There's still some other things that are to manage to consider in part of this. Like I said, it's an extension so it's very easy to install. The one thing Postgres does not for you, it does not automatically create child tables. So, you have to do that yourself. And for things like time and integer and ID based stuff that needs an ongoing maintenance, always needs to make new tables, it always needs to clean up old tables. So, that's primarily what PG Partman's function still is. Is to automatically for time and integer based partitioning to help you automatically manage doing all that kind of stuff. It pre-creates the child tables. You'll see there's a lot of other solutions out there that people talk about. A lot of them will make the child tables on demand. As far as Postgres is concerned, if it's not a very busy table, you're probably not gonna notice much difference between this and doing it on demand. For a very, very busy table, you'll have maybe 100 transactions all trying to insert a new row for a child table that doesn't exist. Well, only one of those is gonna be allowed to make a new table. And all 99 others are gonna wait for that one to finish making that table. So, you can see it for a very, very busy table, making tables on demand can cause a huge backlog. So, everything will be running fine, and then your maintenance will go and run, and you'll watch thousands of transactions back up until that finishes. And then, so, this makes them ahead of time before they're needed. So, it does still take a small lock on the table, but it's not like waiting on anything. It's just making a new table and moving on, because nothing needs it yet. Yes. So, I'm sorry about that. His question is, is it possible to make them ahead of time with declarative? PG Department is doing declarative partitioning. It's not doing anything special. It's just doing the declarative commands for you on the backend, automatically. Yeah, thank you. It's, there's no, there's no, so when she's asking what makes it shorter, there's no additional transactions writing any data. So, if your transaction was like, say, writing 1000 rows, that transaction has to wait for all 1000 rows to be written, because the DDL that makes the table along with the right is all in one transaction. A new table, yeah. Yes. You tell it. You tell it, yeah, so she's asking, how does it know? Sure, yeah, I can tell you there's a value in the configuration called a premake, and so it's set to, by default, set to four. So, whatever your interval is, it stays, it always tries to stay at least four ahead. Okay, yeah, yeah, there's no magic, yeah. Yes, I've seen discussion. I don't know what the plan is. I see Magnus like peeking his head around. I don't know if Magnus knows anything. You made yourself obvious looking at back there. Not really, yeah. So, yeah, I haven't seen any extensive discussion on it. If there is any discussion on it, it hasn't been recent. So, yeah. Yeah, yeah, I agree, yeah, yeah. So, like I said, Partman only does time and integer-based partitioning right now, mostly because, in that case, you're needing ongoing maintenance. Most other cases, I'd say like 90 to 95% of the time, it's a one-time setup. You're not always needing ongoing. There are other cases where it can be, but they're a lot more complicated than Partman can handle right now. But that's the reason people ask you why, why don't you support alphabetical, and all that kind of stuff. It's like, well, there's 26 letters of the alphabet, make your 26 partitions, and you're done. I don't need to do that. I don't need to automate that for you, so it's not really that complicated. So, and for a long time, the only thing Partman supported was range partitioning. Now it actually does support list partitioning for single ID values, and that's actually in the, I'm hoping to have version 5.1 out the end of this month, and that will have list partitioning for single ID values, because it's only one value per child table, but it is ongoing. You're, like, all ones in one table, all twos in one table, all threes in one table, but that goes on forever. So that's a valid use case for using list, in the case where things are still needing ongoing maintenance. The other big thing it does for you is retention management. This is the thing I talked about before. You don't need to keep all the data in the database all the time. You just wanna either drop and get rid of all that old data, or you just detach it from the partition set, and then move it out somewhere. All of Partman's features that would involve data removal are, I try to be by default as safe as possible. So the default retention management is to detach the child table, not drop the child table, so you won't lose your data by default. You have to tell it. You wanna drop the table, in most cases. And anything around losing data, I try to not lose data by default. You have to tell it to lose data. Also, there's a convenience, it's a Python script right now, that you can drop, so you don't wanna lose your old data, but you don't wanna keep it in the database, so it will dump the data out to a dump file and make a hash of it to just make sure it doesn't change. And so there's a retention option that will actually move the data out to another schema. It'll detach the table, move it to another schema, and then all the script is really doing is just what schema do you want me to dump the tables out of? And that's all it does. So it's not even particularly for retention, it's just a script to dump all the tables out of a given schema. So that's what it does. It does automatically make a default table for you. Again, you have to manually make a default if you want one. When you make a new parent table with Partman, it will automatically make a default table for you. And it also has a maintenance function to monitor the default table if it actually gets any data into it. You have to run it yourself, it's not something that's part of maintenance, it's just whatever your monitoring solution is, you can just call this function and it'll return a value. If any default tables managed by Partman have data in them, it will tell you that they do. So there are a lot of options in PG-Partman. You will probably not use any of them except the defaults, but they are there because I've come across them and it's just one of those edge cases you run into on things I'll talk about a couple of them here. It does have a background worker, so instead of having to run, call cron to call the maintenance function, it actually runs a process within the database. You just give it the interval to run on and usually my recommendation for maintenance intervals is two times whatever your smallest interval is. So if your smallest interval is daily, run it twice a day, it just catches those times that it might be a little bit off, like it might be just a little bit off for that day for whatever reason, because it's running on an interval. So it's not gonna run at the same time every day, it runs every 10 minutes, but the run time takes time, so that 10 minute window kinda wanders as the day goes on, so, but you don't need a third party schedule. It's just built into the database, you just tell it how often you want it to run and it'll run that often. Yes, yes, yes, yep, yep, yeah. So the one person was asking about how do you partition tables? I didn't have a lot of detail into the time, because I could spend a lot of time talking about this, but I have a full how-to document on both online and offline partitioning with PG Partman, so you can actually, if a table is in use, you can't take it down. In most cases, you can do it online, all depends. The biggest one that prevents online partitioning is if you have foreign keys to that table. You cannot do it online. You have to take the table offline because you can't make an existing, you can't make an existing table, a partition table. You have to make a new table and move the data to it. So if you have a foreign key, you have to make a new foreign key on another table, so you can't do that online. That's the one big one for moving data, and there's also some caveats you'll see with foreign keys in there as well too that I'll talk about. One thing people don't realize a lot of times until it's too late is Postgres has a naming length limit on its objects. It's 63 bytes, not characters, bytes. So if you're doing stuff in UTF-8 that is a multi-byte character, 63 bytes. So fewer characters. And Postgres will let you make the object, it'll just truncate the name off, and I'll give you a warning back that it truncated the name in all cases, but the thing is with partitioning, most people put a suffix on the end of their table. So if you're truncating off the end of the table, you're truncating off the suffix and you're probably colliding with multiple tables with the same name then. So a part man handles for you, it truncates the table name and then adds the suffix onto the end. Even still in that case, you still wanna try to keep your names as short as possible, especially with ID-based partitioning because with integer-based and ID-based partitioning, your suffix is getting longer and longer and longer and longer as time goes on. So your table name is gonna get shorter and shorter and shorter and shorter and shorter. So I just recommend keeping your table names as short as possible. The other thing I added in there, too, is so that I explained how the whole constraint exclusion feature works, that works well for the partition column. What if you have other, what if you're doing where conditions on other columns in your partition set? It'd be nice to be able to take advantage of constraint exclusion. If that data is static, like it's the old data is not changing very much, part man can automatically add constraints on additional columns for you so you can take advantage of constraints. It's not partition fronting, it's constraint exclusion on other columns. But I've seen people with very, very large tables that add this feature on and it dramatically increases the performance of their queries because they're able to skip hundreds of child tables that they don't even need to look at anymore. But it does mean the data does need to be static because it's making the constraint based on the minimum maximum values of that column and just saying this is the values that exist in the child table at that time. It does have sub-partitioning support. I would not recommend doing sub-partitioning unless you actually really, really need to, like you've done a benchmark and you actually show that it benefits you. But there are some people that have business logic reasons like they have to partition the data up by a certain ID in addition to doing time. You can do that in sub-partitioning but the data always lives at the lowest level and you'll actually often get worse performance because you're increasing the planning time. It has to do more planning to figure out where the data is. And I talked about the PG jobmon extension. It just, if there's like an error in the maintenance and all that kind of stuff, it'll like you can set up a monitoring solution and query the jobmon extension and it will tell you that there was an error in your function somewhere and it will have a log of what the error and stuff is. And so basically what it does is like if you try to, if you think of how transactions work in Postgres, if you're trying to log the error that a function produces within the database, you produced an error so the transaction rolls back. So you rolled back your log so you don't have the logging information. This does a trick with dblink to connect back to the database again with a new session to make sure that error stays logged. Does add some overhead onto it but it allows you to log things. And you can use that extension for anything. I just make use of it in a PG part man to do logging. And another note too, that version five of PG part man completely dropped trigger based partitioning support. If you need it, it is still there in version four but I would highly recommend moving away from it as soon as possible. So some current issues in core with Postgres, like I said, no global index. What this means is if you have a primary or unique key and that key it does not include the partition column, you cannot make it. Postgres will not allow you to make a index that covers multiple tables if it doesn't include the partition key. Because if it includes a partition key, the partition key is constraining what data already exists in every child table. So it's kind of already a global index. If it's not part of the partition set, it's not supported. So what part man does is it, I talked about it the other way, he's a template table. You can still make the primary or unique key on each individual child table. It's just not enforced across the entire partition set. It's just only enforced on each individual child table. So I know it's been a well before partitioning been a long wanted feature inside Postgres that is going to take, still take a while I think to support. It's not gonna be there for a little while. This was an odd bug that I've run into in Postgres with unlogged tables. So an unlogged table, it skips the wall. So it just writes directly to the data files. But that means that table isn't crash safe. Your database crashes, you lose all that data. It's just gone. But it's much, much faster. So if you're able to reload the data, an unlogged table can be very helpful. However, if it's a parent of a partition set, the unlogged flag is not actually set in the catalog. Just completely ignores it. You say alter table, set unlogged, it doesn't set it, it doesn't do anything. So I just have it on the, I'll show you how I handle that in part man later. The other thing too is dropping child tables. If you have a foreign key to the partition set, if you try to do a drop cascade, you'll watch your entire foreign key disappear for the entire partition set, not just the child table. Yeah. So if you don't do drop cascade, you'll get an error back that it can't drop it because the data exists inside the table. So usually what the drop cascade will do is delete the data and then let you drop the foreign key. This also drops the entire foreign key. I don't know why. So there's no way around that that I know of. So actually the way around it is to clean out all of the data as far as for the foreign key reference and then drop the child table and then that will allow you to do it. I don't know. I don't know a way around that right now. Relation options are not inherited from the parent. So if you try to turn auto vacuum on or off on a per table basis, that's if you set it on the parent, it's not set on the child tables automatically. And also I learned recently that if you're doing logical replication, the replica identity of the parent is not automatically set on all the child tables either. So last thing for a part man. So what does part man do to work around it? For things I have a thing called a template table. Things that you can't apply to the parent like the unique key. You can't even make the unique key on the parent if it doesn't include the partition set. So how would you handle that? There's a template table. Just make the key on the template table. Part man will inherit it from the template table. I also do the same because not all the relation options can be set on the parent either. So I set those on the template table and again the unlog status. Set it on the template table. It'll set it on all the child tables for you. Privileges are not automatically inherited from the parent either. I think that's probably intentional. So what that means is you can't automatically directly query the child tables by default. You have to go through the parent table. So I think that's intentional. I don't think that's gonna be fixed. But there's an advantage of being able to access the child tables directly. You're basically bypassing partition pruning and tuple routing, so it's a lot faster. If you know the naming pattern of your child tables, you just query them directly. Don't even worry about going through the parent. So I have an option in part man that just say inherit the privileges from the parent. I'm the same thing for a replica identity. Upcoming version 5.1 will just automatically inherit the replica identity from the parent table. So just kind of wrap up here. So partitioning is pretty first class feature in PostgreSQL right now. It's pretty mostly feature complete. Just some, like see minor, not minor, but not gigantic issues other than the global index. That seems to still be the giant thing. I'll say that again the primary reason for partitioning in Postgres is data retention. I would highly recommend if you have a single query or a single table and you're having query problems, performance problems, try to tune the table and the options and your query first before going to partitioning. Cause this is a lot of overhead that you may not need. So I'd recommend trying to tune your queries first before you go to that. And like I keep saying this, I would prefer that part man be completely obsolete. That you could just do all the stuff in part Postgres and not need it. So I am not trying to take away from Postgres's features and say this should be only in my extension. Don't do this in core. No, please do it in core. I please make my extension obsolete. I'd be ecstatic. So that's it. So any other questions? Yes. I have opened an issue with the Unlogged table. I don't know if I did it more. I know I did it once and it was, it's kind of thread kind of fell off. I don't know where they're going. I'm still trying to properly make a use case for all the foreign key issues cause it's weird. So to report it, but I think it has been reported as well before but there's a lot of stuff to work on. So yeah. No, this is all written in PLPGSQL except the background worker is the only thing that's in C and that's actually completely optional. So you'll see this extension is up on RDS and that kind of stuff. The background worker is not in RDS as far as I know but all of this is written in PLPGSQL. I've talked to, I don't think something like this will make it into core because I don't know the, it's very clutchy in my opinion. It's useful but I don't think it's as streamlined and well thought out as I think something going into core would be. This has just kind of been something that I've been working on with a few people helping me where we're needed but it's kind of just been like a little one-man project myself. I didn't honestly realize how popular it was until I started coming back to conferences after COVID cause during COVID is when 10, 11, 12, 13 of Postgres and stuff was out. So more people were getting into partitioning more people were coming to this and then I come to a conference after COVID and like a whole bunch of people are using it. I had no idea. So yeah. Any other questions? Yes. Is asking about Merge? I actually don't know. I haven't seen what Merge support is like. I actually completely forgot about that. Yeah, I don't know. Yes, yes, that's what the globe. Yes, he was asking what global indexes solve not being able to have a unique key across the entire prediction set. Yes, because it's a covering index. I don't think that's quite the same thing but it's an index that covers more than one table and it's apparently a very hard problem in Postgres right now just because of the way it's engineer. So, but there are still, I looked it up. There's like still active threads talking about this last week and on the hackers list in Postgres. It's a, but it's, I've seen this conversation going on long before 911 and Partman came about. It's been an ongoing long wanted feature. Yes. So that's the main problems. I don't know enough about the backend to be able to give you a good answer to that. Yeah. Yes. What's, I'm sorry. Oh, I'm sorry, yes. Yes, sorry. Yep, yep. Sure. Yes. Yeah, sure. So the way, so the main, I think I don't think I'm going back to it. I don't think he used the built in method of Partman at the time that we found that because it wasn't very efficient and it was two billion rows. So I think he, I think what he ended up doing was, I think he completely detached the default and made a new one and then slowly moved his data back into the partition set. What's that? Yeah, select like, yeah, select insert, insert star, select from, yeah. Yeah. Yeah, yeah. So there are, so there is a function in PG Apartman. It's called partition partition, PG partition proc or some more partition time. So there's partition time and partition ID. So what it automatically does is for data and default, it will make a temporary table and it works in batches of whatever the interval of the table are. So if it's like daily, it'll grab a days worth of data out of the default, move it out into a temporary table, make the child table, move the data back into the partition set and I'll do that all automatically. Just keep moving on one after the other. The thing with that is right now is that the smallest chunk you can do online is the size of the child table because you have to move all of the data out of the default before you can make the child table. So you can't do a smaller chunk than the interval because you can't move the data back in until the child table is there. So I am looking at ways to, so it does all that in one transaction. I am looking at a way to be able to do the transactions in smaller chunks, but that means the data is gonna be missing from the table for a little while. So if it needs to be fully online, you can't do it fully online and do it in smaller transaction chunks than the interval size. Does that make sense? Yeah, okay, so yeah. Basically, it's move the data out, move the data back in is the way to fix it. And the monitoring function I have, it doesn't actually go do a full scan of the default table. If it sees even one row, just immediately stops and says I found a row in the default. If you want a full accurate count of how many are in the default, there's an option to that monitoring function and we'll do a full count of the default, but the default check on it is just, is there even one row in the default? If it finds one, it just quits out and reports back that it found one row. You can go investigate it further. So it's pretty efficient monitoring. Anybody else? And what the, what time is it? Sorry, still got I think five minutes until the next session starts, or is it? Oh, it didn't start for another half, okay, all right. I'm happy to hang out and talk to people. Yep, all right, thanks very much. All right, good morning. This is graph walking and Postgres for fun and possible profit. That's what we call foreshadowing. It's the story of how I built a system to sift and correlate a large volume of data using Postgres at the center of everything. About me, who I am, I'm, as I said, Mark Breitscher. You'll find me as maker 17 on various places. My degree is in theoretical math, but the only thing anyone has ever wanted to pay me to do is play with computers, and so that's what I do. I'm a software engineer and data nerd at a succession of startups in the area. And this is the story of how we built a system at one of them. So how we got here. I was at a startup about 10 years ago, and our existing business was data discovery and analytics for small and medium sized companies who couldn't or wouldn't do it for themselves. Think Corner Bakery, a coffee shop, a contractor. We would find, we would crawl the web and we would find data about the business and identify things that were wrong and help them fix it. And we would also crawl reviews from everywhere and do some analysis on them, like things like the food was good, the omelette was nice, but the waiter was rude, deal with that. Anyways, that was our existing business, but what our customers told us over and over again is what they really wanted to know is who their customers were. These are small, medium sized businesses again, mom and pop shops, CRM is not in their wheelhouse even if they had the time to try. At best, think of the contractor we're talking about. They come back from a day in the field and they've got a stack of voicemail and all they know is the number. What they really want to know is whether you own your home or apartment or condor or whatever. If you own, then you can authorize them to do work and they want to talk to you. If you don't, then to get work done, they have to deal with your landlord, which they don't want to do, they don't even want to return your phone call if we're being completely honest. But they don't know that, all they know is the phone number. Or in the case of say the bakery or the coffee shop, there might be a clipboard on the counter where you can put your name and your phone number and your email address if you wanted them to send you a coupon. But that's pretty much the extent of what these people have. So we set out to build a system to profile individuals based upon the small bits of data that our client businesses actually did have. The phone number, the email address, whatever they had, we would try. One of the founders hacked together proof of concept and he used that to drum up some interest and some money. And I was tasked with commercializing the idea and running it at scale. And after trying a few different things that were false starts, we decided we're gonna treat the problem as a graph problem. You cut a problem up into small pieces and solve it that way. In this case, each node in the graph is going to be a piece of data, a phone number, a physical address, an email address, a name, a Twitter handle, how much money you make in a year, whether you're on that home. The edges are going to be nodes that we think, for whatever reason, are related. In this case, the probability that we think that they are related to each other. How sure are we that this name and this phone number go together? And what we're looking for is a graph, what we came to call a graph in the neighborhood, the enumeration of all nodes that are reachable from a starting point by a path that doesn't exceed some threshold in its weight. Built a test data set of about 10 million nodes and 50 million edges and started trying it on every system I could on my work laptop, which was not a slouch. We're looking at much bigger than this eventually, but to test 10 million nodes, 50 million edges. And every graph system I tried would OOM. They're really good at path optimization. From node A to node B, find me the optimum path given weights on the edges, but this path enumeration problem, they are not good at or weren't good at, again, 10 years ago, maybe they've gotten better. At the time, not something they were tuned to do. The test bed had about 32 gigs of RAM and they would just blow through it, trying to, I think, prematurely read too much of the graph into memory and not enough pruning on poor paths. In any case, not good. After weeks, maybe a month, of trying various graph systems, my boss, the VP, comes by my desk and quietly asks, is this thing actually gonna work? Like, do we keep funneling money into this? And I said, yes, I think it, I mean, my gut tells me it's gonna work, but my gut also tells me that these dedicated graph engines are not the path forward. And I started trying to play around in Postgres, because one, I like Postgres, and two, if you squint hard enough, every problem is a database problem and this one fits. A pretty simple schema, kind of what you would expect. You've got nodes, you've got edges with foreign keys back to the nodes. Our nodes have types so that we know which nodes are email addresses and which nodes are physical addresses and which nodes are what college you went to. And we also have the documentation behind all this. We have the sources where we find data and then for a given source, we have an edge source that tells us specifically in that dataset or that website or that API, what was the call or what was the key that gave us information to suggest that this name and this phone number or this name and this email address or any two nodes go together. That documentation's all stored in the edge sources and the sources. So now's a good time to point out that all this SQL, DDL and DML you're gonna see today, it is functional but it is not precisely what was in the system we built that I built because primarily I haven't had access to that Git repo in years. And well, that's really the big one. Functionally, it works. I've actually got it running on my laptop. If we have time at the end, I'll play on the screen. We'll see. Anyways, types, pretty simple, an ID and a name. The nodes, we've got the type referencing back to the type there. We store a normalized version and a pretty version. We'll talk about that in a second. And then we've got a unique indexed on the normal on the unique index on the text of the normalized version. And then we also have a gen index on the normalized so we can find things. Yes. I think that's enough for there. Looking at the edges, the edge is pretty much what we'd expect. You've got an ID, you've got the notice coming from, the notice going to, the probability that we think these two are connected. You know that the probability is bigger than zero and less than one. If it's zero, we don't care. There's really no reason to create an edge. And we live in an uncertain world so nothing is going to get 100% probability of connection. Everything's going to be somewhere up to that. Unique index on the from and the to so we can't duplicate edges. And then an index on the from and then descending on the probability. So it's easy to find edges out from some node that stay above some probability threshold. And we do some adjustments on the probability. If you're given a phone number, say 213-555-1234, it's pretty clear that the name for that is going to be John Smith. If you're trying to build a profile for some John Smith somewhere, the probability that any one of these phone numbers is the one that you're looking for, that is his phone number, vanishingly small. So what we do is we take the base probability for the relationship and we start to weight it down based upon the cardinality out from say this name node to nodes of the same type. If you've got 20 phone numbers attached to the same name, they all get weighted down so that we don't walk out to any one of them trying to build a profile. So you can walk really easily from the phone to the name, not so easily from the name to the phone. And this is just the sources and the edge sources. Again, pretty much what you'd expect. The source has a base probability and a name and some URI that identifies it. The base probability for most sources was very low in the 25 to 30% range. And the idea here is that a lot of, until we've done some research on a source of data, we assume it's crap and we give it a low probability to store the relationship and then if we decide after looking at it that it's better than we start boosting things up, we also will take reinforcing evidence from multiple sources and start boosting edge probabilities up. So if five sites tell us that John Smith's phone number is that one, two, three, four number, then that edge gets boosted up because five independent sources telling you something is better than one. Even if they're all generally bad sources, if they all agree on something, we hope it's true. And then the edge source gets the probability from the source it's coming from and a URI that identifies in that source where specifically we found it, what page we found it on, what key into a data set, what key and a request to an API, whatever it is, that's where that gets stored. Again, source is unique for the edge and the source, just we don't wanna duplicate data and we're in a database so we can do things like that. Yes ma'am. Some of the people I worked with at the time were big fans of surrogate keys. Oh yes, she was asking why the surrogate key if we have a unique key on the edge and the source and the answer really is some of the people I worked with at the time were fanatical about surrogate keys and so here we are. Yes, local standards is I guess the answer to that question. Taking a look at a node specifically. For any given data type we started looking at how we could chop it up to represent it in JSON which is storing the JSON as the payload in the node and then we would normalize it to try to to make it unique within the set and easy to find. Taking, this is Pasadena Convention Center, 300 East Green Street, Pasadena, California 911-01. We got the pretty version here and we got the normalized version here. For addresses specifically, we took the normalized address structure from post-GIS because we were already using it for other things. We added the plus four because we wanted it. I think they have that now but at the time they didn't and we added latitude and longitude. I had this unrealized dream that we're gonna do lots of things with latitude and longitude but really just got used to put points on a map. You'll note the normalized is generally uppercase and abbreviated and all the fluff has been rung out of it but yeah, so this is unique within the graph and then the pretty is what we're gonna return at the end. The end result, the goal here is to, given say a phone number, to return a JSON document with all the information we could find to profile that individual and then the node stores some piece, some fragment of that JSON document, for instance, an address and at the end we're gonna take all the nodes that form the neighborhood around our starting point, we're gonna merge all the JSON together and return stash that document and that's the end hope here is to return a JSON document as the result of an API call and so here we have the resulting document or well the query that generates the resulting document. It's a recursive CTE, a common table expression which is where SQL says with and then a name possibly some structure as this. I've highlighted the bits that are key here. One, this is a recursive query. We'll talk about that in a second. Well, actually we'll talk about it now. Recursive and postgres, you have to declare it recursive and then you have the two sides of a union. On the top here we're finding our starting point, in this case, 626-555-1234 is a phone number. We're doing JSON containment to find that. Then your union with a join back to itself. In this case, we're querying defined edges out from nodes that are already in the neighborhood to other nodes where the probability of the edge keeps us above 0.45, 45%. Other things to note, we're keeping an array of the IDs we've seen as we work our way through the graph. We use this down here to avoid creating any cycles. You don't wanna loop back to yourself. And then at the end, once we've recursed through and we've found all the nodes we're going to find above 45% on a path, we take the node ID, the pretty version and then the maximum probability we saw for a path to that node from our starting point. There's likely multiple paths in here for a given node and we're just interested in what's the best one we found. I played around a lot with trying to do better early pruning than this and everything I did was clunky and this got the job done. So some duplication and then we do some grouping at the end to sort of wring the duplication out and give us the unique set of nodes. And that's really it, or most of it. Oh, other things to note, sorry, let's see. We're dealing just with, we have our node starting point but after that it's just the edges and then the neighborhood CT which is in memory. Unless it flushes to disk. But anyways, we'll say it's in memory. So we don't need to deal with the node data specifically at all, which is good. It's about as tight as we can get it. So the hardware this thing ran on. My father once told me that one sign of a good job is that they buy you nice shiny toys to play with. And by that metric, this job didn't disappoint. When they asked me what I was gonna need for hardware I said something along the lines of as much RAM as you can cram into a box, a lot of CPU and a whole lot of storage. And some of that is gonna have to be on SSDs. And I didn't think I was gonna get it all but what ended up getting delivered on a pallet was two servers each with 512 gigabytes of RAM which led to some uncomfortably large settings for Workmem but good uncomfortable large settings for Workmem. 28 Xeon cores, several terabytes of SSD and about four times that in spinny disks. We formatted everything in ZFS. We put the indexes and all the small table data on the SSDs and then the big tables, nodes, edges, edge sources had their data on the spinny disks but their indexes again were on the SSDs. We use ZFS primarily because we're used to it from our existing database servers and we're comfortable with it. And we kind of built our backup strategy around that. We would do physical replication to the hot standby and we would queues that periodically take a ZFS snapshot and then ship that offsite where we would apply it to another smaller server just to make sure that we actually got a nice working server out of the result because it's not really relevant. Specifically this but always test your backups if you can't get a good working server out of it it's not really good. Anyways, a nice beefy box, two nice beefy boxes and nice shiny toys. And that's the MVP. At this point we're working just with my synthetic dataset 10 million nodes, 50 million edges and it mostly works. We get rid of the synthetic data and we start soliciting real data from our clients which usually goes something like this. We get a request at the beginning that's the data from that clipboard or later on it's integrations into things like phone systems, email systems, watching their Facebook profile to see if people DM them or talk to them, Twitter, whatever. Any connection, any communication from an individual to the company, we try to grab onto it and build a profile based upon it. So a request comes in, it goes into an initial discovery which first looks at some local datasets. We had primarily two, we had one that was every residential phone number in the United States and the name and physical address for it and some demographic information. And another local dataset was every cell phone in the United States and again name and some information attached to it. We hit those first because we have them stored locally and it's quick and easy. And there's a couple APIs we reach out to and all that is sort of our initial first pass at what can we get quickly for the data we've been given. We throw all that into a crawling queue and we also throw it into a document queue down here to go into the graph. The graph populator takes that document which is actually in the format of our result, the JSON document result we want in the end, chunks it into pieces and starts creating nodes or finding nodes that they already exist, adding or boosting edges, doing all that probability leveling we were talking about before and throwing the data into the graph. Crawler comes over here to the web crawler and that's just going out into the web and finding everything it can based upon the information that fell out of initial discovery. So maybe we started with a phone number and we were able to get a name and a physical address and now we've got a little bit more to latch on to as we search the web for data. And all that ultimately goes into the document queue again, gets chunked into pieces and flows into the graph. And so we start shoveling data in pretty much as fast as we can. And about the time we hit 500 million nodes, the sharp edges start to show. So at this point we're at 50 times the size of the initial data set we were testing with and it's real data and real data has fun real problems with it. Quickly the problems that showed up and I'll go into each of these in detail shortly but that worker that maintains the graph that takes the document data, chunks it up and starts to manipulate the graph to put it in, he started to slow way down and we ultimately figured out why. There were some sporadic reports of cross-profile contamination. You're building a profile for George and you get some data about Sally as well which is not ideal. And then someone came along and said that they wanted to change some of those data normalization rules that govern how we represent the data in the graph. So the graph maintenance performance, it all came back to, it's easier to point out if I do there. It all came back to this which is the query to figure out how many phone numbers we already have for John if we're adding another one. If we have a thousand known phone numbers for John and we're adding the thousand and first, the query to figure out how many of them are there for big queries turned out to be very expensive. And it's all about joining the edges to the nodes and looking for nodes of a particular type and doing a count. In most cases it was fine but for large cardinality sets, it was not fine. We gave this, we tried boosting the work momentum for the worker and it helped but not enough. And so ultimately what I ended up doing to fix this was I created a function that I called node to type that just goes out to the nodes table and grabs the type. And if we lie and say that it's immutable and we'll come back to that in a second, then we can create an index on the from node type, from node ID and then the type of the to node, the node that we're walking the edge to. And the point here now is that I can quickly with an index scan get the count of edges out to phone nodes from wherever I'm at rather than joining to nodes and going through the rigmarole and the large set. It's just a partial index scan. You do have to tell Postgres that it's immutable because you can't build an index on it if it can change. Which is smart, I mean it's good. It's just that in this case, well let's just say it's immutable by fiat. I say it's immutable. I say that if you change the type of a node I'll be very unhappy with you and then I don't give anyone access to do it. But it technically could, you technically could go in as a Postgres user and change the type of a node. But realistically, if a node is created as a phone number it will always be a phone number. Or it might fall out of use entirely but it's never gonna turn into a Twitter handle. It will always be the thing it was created at. So we can safely tell the little white lie that it's immutable and build the index and all of a sudden our performance problem goes away. It's almost like magic. I'll take your word for that. And again, this is me creating this after the fact. Trying to get something to work now. I couldn't tell you precisely what the function looked like at the time. I do know this works, fair enough. All right, cross-contaminated profiles. Originally this looked self-inflicted and I just said that's the sales people they can fix it. What happened was, or what brought it to light was the sales people would load their own data into the graph so then went out to a client site they could run a profile on themselves and look at this wonderfully detailed profile we have. And they loaded up all the data they could think of about themselves with insanely high probabilities like 99.999999999% and they included everything they could think of including income data. They had their salaries at a highly detailed level like pennies for the year. And it turned out that two of the sales people made exactly the same amount in a year to the penny. And so it was very easy to walk into that shared node. You have to understand that most of the income nodes in the graph were like, oh, to the thousands or 10 thousands of dollars, right? We're looking for buckets to place people and find how much they make. Not that they make 100,593 dollars and 17 cents for the year. So it's very easy to walk into that shared income node back out to data about the other sales person and merge the two profiles. So like I said, initially looked self-inflicted but it did turn up a larger issue which is to say nodes that have edges out to a low cardinality set of nodes but still more than one. Great example of this would be me. My name, as I said at the beginning is Mark Breacher. There are, as far as I know, two of us in the United States. There's me and there's a professor in Ohio and that's really it. So if you're building a profile on me or from my phone number, it's very easy to walk into the Mark Breacher name node and then walk back out and start pulling in data about the professor. Assuming all the edges are high enough probability, that's not that hard a thing to do. So okay, it started out as self-inflicted but now it looks more real. So it's a legitimate problem. We end up solving it by giving types an identifying attribute which is to say, is this type of data identifying of an individual or not? And we're gonna say that things like emails, phone numbers, Facebook handles, Twitter handles, physical addresses, these are identifying-ish. Which is to say, given one of them we're identifying a person or maybe a household but in the case of a physical address. But everything else, how much money you make? What your name is, what college you went to, these things are not identifying and then we're going to modify that recursive query so we'll only walk out from nodes that are identifying. You can walk into any type of node and pull it into the set but to walk out the source has to be an identifying node. And that query then looks like this. I've highlighted the changes. We're picking up the type ID as we start and we're picking up the type of our two node as we progress our way through the graph because we already have that in the index. And then we're restricting ourselves to type IDs that are in the set of identifying types on the types table and that's enough. That keeps us from walking, yes. Oh yeah, sorry, positional. We're grouping by the first and second column and then our aggregate is the third, yeah. So these changes are enough to keep the graph walker from walking into shared non-identifying nodes and polluting the resulting, sorry, polluting the resulting document, profile, profile, polluting the resulting profile. And then finally, remember, somebody wanted to change some of the normalization rules. This one is tricky because we use the normalized version to enforce uniqueness in the graph and we've now got about a half a million nodes, not a half a million, 500 million nodes, a half a billion nodes. And there's one of two possibilities. You're either going to make the normalization more restrictive and now two nodes that are currently unique might no longer be unique and they're gonna collide in the graph or you're going to make the normalization more permissive and data that currently has collapsed into more than one node is suddenly gonna be, data from multiple sources that collapse into a single node is maybe representative of multiple. Either way, it's almost, I couldn't find a way to do it that wasn't gonna be a stop the world event. I either have to dump out, why, pretty much I have to dump out everything of a type, renormalize them and load them back up again. And if I try to do that while the plane is flying, well, I won't be able to find any of those nodes to start a crawl and I won't be able to walk into or out of them. It's just a bad idea. So I talked them out of it. Yes, this one. I don't think so. If I'm being honest, I think the original version of this, the Walker query, the set of node types that we're identifying as it started up and just spliced them statically into the query. But this is easier to actually figure out what's going on instead of just saying like one, three, seven and 12. I actually haven't profiled it with the in versus the join. Yeah, on my laptop at the moment, there's a very limited set of data, so everything is really fast. In this case, I would have profiled it on real data, but again, yeah. So as I said, I just, I ended up talking them out of it loudly and at length. There may have been some threats, but the problem went away. They decided they didn't really need to change the normalization rules. And so it works and it keeps on going. It keeps on growing. At peak, it had about a one and a half billion nodes with about eight billion edges and 20 billion sources for those eight billion edges and thousands of sources in below that we're pulling data from, correlating, sifting to produce profiles. Yeah. Okay, it works, but should I build the thing in the first place? Yes. Same hardware. Now, a couple of things here. ZFS compression helps us because we're able to trade a little with CPU, which we have a lot of for decrease in IO and in storage. But yeah, the same server that we started running it on, it grew to this size. Once it was moving smoothly, I started spending most of my time trying to think about what we were gonna do when we couldn't fit it all on one server. Because, well, because I don't like having a ceiling on my growth. The hiccup there was really gonna be, once you shard the data, saying to site us or something similar, that recursive CT is no longer an option. You're gonna have to do more work client side and have data going back and forth and things are gonna slow way down. But that was actually the focus of my work once things were moving smoothly. But, and this is the fourth shadowing from earlier, about this time, the company ran out of money and closed the doors. So, it stopped being a problem I had to worry about. I had other problems to worry about. Like, I built this system to take everyone's private information, well, public information, but identifying information and sift it and correlate it together and build these nice detailed profiles and I no longer have control over it. We are talking at the core about harvesting, mass harvesting of personally identifiable information and pulling it together to profile people. GDPR was not yet a thing and we weren't in the EU, but still, CCPA wasn't, definitely wasn't a thing yet. Although I would say to try to do this today would be a very different story. Mass harvesting of PI and storing it problematic because of the way we're representing the data, removal requests get problematic as well. So, I wouldn't wanna try this today if we're being honest. But my big problem when they shut the doors and laid everybody off was what's gonna happen to this tech, right? Like, I built this thing and it works and it works really well and now somebody's gonna be able to use it for whatever. Thankfully, there's a bit of a silver lining there. Some of the IP, the founder managed to hold on to the IP and eventually rolled it into a company to help individuals find and deal with their PII that's scattered around the internet. So that's the silver lining here, the moderate happy ending, but still, yeah. I was, while I was working on it, I was myopically focused on the technical challenge. I've got a problem to solve, I can do this, I'm smart enough, blah, blah, blah, blah, blah. And it was only when things started to fall apart that I started thinking about the longer reaching philosophical discussions. Anyways, wow. I must be, I guess the caffeine is having its effect. I got through that in about 30 minutes instead of 30, 35 minutes, instead of like the 40, 45 I was counting on. So this is where we take questions. The SQL, all the SQL from the slides is in Git. At that address, QR code will take you there. An older version of the slides is in Git and this version of the slides will be in Git once I get a chance to export it and push it up there. But all the data is there for your interest. Anyways, questions, comments, anything? Yes, as with many things in life, it depends. It really depends on how big that graph neighborhood ends up being around a given node, generally sub-second. Yeah, generally there were some profiles that went out to like 10, 15, 20 steps on the graph. Yeah, yeah, yeah, well it's not, even then it's not so much the depth as how much data you're pulling together at each step that you're then joining back out from the next one. I mean, generally, I mean, we didn't hit any diabolical edge cases so I didn't dig deeply into how much it was recursing. Yeah, no, no. The performance problems were all on the graph construction side. The querying the graph generally just worked once it worked. Yes, sir. Probability is calculated and maintained separately and then the recursive CTE is compounding the probability of the edges as it walks through. So here we go. The new probability is probability of the path thus far in the neighborhood times the probability of the edge we're about to walk out. And then we're restricting such that the probability of the edge we're about to walk will keep us above 0.45 on the aggregate path probability. So we don't consider nodes that would take us below our threshold. We don't consider paths, sorry, we don't consider adding edges to nodes that would take the path below our threshold. Does that, the first time we add an edge. So the first time we add an edge between two nodes it's the base probability from its source as we add more reinforcing evidence to that edge then we're boosting it based upon the probability it had and the probability of the source we're boosting it from and then we're doing some of the weighting down based on the cardinality out to nodes of the same type. So there's kind of a balancing going on there. Boosted as we find reinforcing evidence, weight it down as the cardinality goes up for a type and things kind of settle out. So that it's easy to, again it's easy to walk from say and a phone number to a name, not so easy from a name to a phone number. That sort of thing is happening as we load up edges and we load up nodes and then also the reinforcing, the boosting reinforcing probability happens again as we're adding an edge. So if we came along and we said, oh the edge already exists then we would say, oh it exists at a 40% so we're gonna boost it based upon this new source that we just found. Did that help? Okay, yes sir. No, I mean, it's more limiting. It's limiting the, it actually improves things because it's limiting the paths that the graph walker's willing to consider. He's not willing to walk out from anything isn't identifying so there's fewer possibilities. I was afraid that the profiles are gonna get leaner and less full featured and but I was willing to accept that so long as like the profile didn't merge five people together. I mean, a lean profile that is accurate is better than a nice full fledged full featured profile that is worthless, I guess. If that, yeah, all right. Well, this has been graph walking for fun and it turns out no profit but I had hopes. Anyways, thank you. Check, check. How is this? Do we have to do anything? Is that on the speakers? I can't tell. Can you hear anything? I don't think it is. Uh-oh. Oh, it is? Okay. Can you hear me, Frank? I can, do we have to do something magical to make the microphone? It's just not turned up very much. Well, if it's not, actually, that might be it. Is my voice carried in the room okay? No, not really. If it, I can always switch over to the other one if this doesn't seem like it's loud enough. Just try it and see what it sounds like. Sure. Okay, well, all right. So this is wireless microphone right here talking on this one. And this is, oh, that's a lot different. That is a lot different. Okay. So we'll see if I can get this thing to stay on. Then, ooh. It's also not staying on my ear now very well. Not like it was earlier. Yeah, it's turned on right now. And it's not very loud. Yeah, yeah. It is turned on right now. I'm speaking into the microphone. It is in there. It's in the speakers. It's just not very loud. Is this gonna mess up your audio recording though? Do you have to? Got it. Got it, great. That's better. Now I can hear it bouncing off the walls and coming back at me. So it must be turned on. Yeah, so it would be too high. Yeah, yeah. Yeah. Let me keep talking, that way you can decide. They're setting the volume on the wireless mic, y'all. We're just testing here. Testing. Testing. I have to keep talking. I'm not a good person for this. I don't have jokes. I can just pull out of my pocket and tell. All right, yeah, sounds good to me. We'll give this a try. Thank you. Thank you, AV people. All right, let's get going. It's 12.30. It is 12.30 and you have not eaten lunch yet. And I didn't really think about that when I put a giant picture of pie on the front slide of my presentation. So for those of you who celebrate pie day, this is for you. The rest of you probably hate us right now, but I did get something for the rest of you too. I got you this. Somebody posted that on the Slack channel at work this morning. I couldn't resist grabbing it. All right, but we're not here to talk about pie day. We're here to talk about Postgres, which needs no introduction. However, you need an introduction because this is my first time at scale. So I'm kinda curious a little bit about all of you. Real quickly, show of hands. How many of you would consider yourself new to Postgres? I'm curious. Okay, a few people new to Postgres. How many of you guys work with Postgres like in your job, show of hands? It's lots of people, cool. I'm kind of curious, how many of you would you consider yourself, like you're responsible for a Postgres database? I mean, so like, as in like, if you kind of write some SQL queries, you build an app, but somebody else kind of takes care of the Postgres database mainly. That's not what I'm asking. I'm asking like, if you're kind of the person, maybe who, whether it's making an API call or whether it's racking a server, kind of whatever it is, for like getting things wired up. If you're responsible for a Postgres database, raise your hand in some way. Cool, that's kind of interesting to see. And then two other things I was sort of curious about. I'm curious, like I said, as scale's a new event to me, is there anybody else here that this is your first time at scale? Cool, wow, that's really neat to see. Lots of people. And I'm curious, one other thing I'm curious about, I saw how this is set up with almost like all these many events inside of it, which is really neat. I'm curious how many of you think you're mainly or only going to attend sessions in these two Postgres tracks? So raise your hand if it's that, versus going and mixing all around. I'm just kind of curious how this works out. Okay, so like a mix, yeah. Actually, and I'm, there's a lot of people, I'm seeing a lot of people who didn't raise hand, so I think that means a lot of raise your hand if you're going to multiple tracks and like checking into all kinds of things. Yeah, that's really great. That's, I love the way they've set this conference up. This is really cool. Thanks for humoring me there. All right, so anyway, like I said, we're here to talk about Postgres, which I don't think needs a lot of introduction, especially based on this quick survey. One thing that Postgres is pretty well known for is standards compliance. It's one of the reasons that it's a popular migration target from a lot of other databases, but there's one thing that the SQL standard doesn't really talk about very much, and that is performance. But that's not to say that there isn't this whole field of database performance and that there isn't actually like decades and decades of thinking of research and of an awful lot of work that's been done around database performance specifically. This goes back pretty far. A paper that I've heard referenced a few times is this one from Robert Miller, Response Time and Man Computer Conversational Transactions. This is 1968, y'all. Now this is kind of more systems performance kind of stuff, but it very much lays the groundwork for a lot of the thinking around database performance as well. And then I also grabbed a little, I grabbed a printout of this 1969 page from the IBM System Management Facilities manual, which also digs into this a bit. Now that's kind of systems performance, but one other thing that I like to point out that I think is kind of neat, is Brendan Gregg is fairly well known for a lot of his work around flame graphs and Perf and EBPF and like Linux systems, things like this, but Brendan has this page on his website where he talks about the use method, which is a method, it's been kind of more server centric, utilization, saturation and error rate, some things to watch for. But on his page here, you'll see that he makes reference to something. He says the use method might find like 80% of your server issues, but latency based methodologies, for example, method R can approach 100%. If you're not familiar with method R, what he's referencing is something that came out of the database performance field. It's not something that came from systems. Like again, like this is Brendan Gregg here referencing something that came out of the database performance professional field. There's a lot of really interesting stuff here. Now I sort of threw this in this morning. I actually kind of sketch us out on the airplane on the way here, but it was just kind of some thoughts that are percolating around in my head. I feel like over the past few years, I've actually been around database performance stuff quite a bit more than I had previously in my career. And I almost kind of feel like there's these two, a lot of database performance could be almost like two different areas. There's kind of on the left side, this idea of evaluating software, on the right side responding to problems. Now super interesting stuff. I'm very interested in all of this. For today's talk, I'm actually going to be focusing more on kind of this right hand side responding to problems. But there is a lot of overlap. Like a lot of times when software and hardware evaluation, I kind of put some, you could read Jim Gray if you wanted on that for some really old stuff. A lot of times you're focused on things like price performance. Performance is usually the kind of canonical way that you would talk about that in terms of throughput. But throughput is, I've looked at a lot of charts over the past few years, throughput is directly a function of your response time or your latency. And response time is the thing that we wind up focusing on over on the other side of the chart when we're trying to respond to problems, when users have bad experiences, right? When Jim in accounting goes to the CEO and says, you know what, the payroll process every single week has taken five minutes longer than it took last week. And on the track that we're on, within a few weeks, we might not be able to get payroll done by Tuesday morning and we're gonna have trouble with you and if we can't get paychecks out and then everybody's paying attention, right? User experiences focusing on response time. Those are some of the challenges there I jotted down at the bottom are very different from the challenges on the left. On the left you're usually, your challenges are usually around kind of engineering software and hardware trying to figure out how to get the most scale ability out of a system and out of an architecture and then efficiency to keep the price reasonable. But on the right, a lot of times the challenges are actually more like coordinating across multiple departments. It's things like, okay well, the storage guys are telling me all of their health metrics look good and the network guys are telling me their health metrics look great and the operating system metrics and then not only that, you have multiple experts, right? Who each have a list and somehow each of their lists is in a slightly different order and you've got to figure out whose list to use and which way to go. So this is kind of some evolving thoughts. For today's talk I'm gonna be thinking primarily about responding to problems when I talk about database performance. Now let's take it back a little bit. I am realizing I come to a Postgres conference like this and there's a whole bunch of people that I feel privileged to hang out with who have been doing databases decades, decades longer than I have. At the same time I'm increasingly, I go to work every day and I realize I get to have the pleasure of hanging out with some amazingly talented and smart people who are recently out of school, have tons of energy and I'm realizing I'm not always the young guy anymore in the room either and I do remember, I was doing databases, I sort of started around software in the 90s and I started with databases in the early 2000s. I very much remember kind of this era when the one, the main thing that we had was counters, we often focused on ratios, we focused on percentages, this is what we had to work with, especially in the 90s and early 2000s for operating systems also for databases and like the golden metric in the database world was the buffer cash hit ratio, that was I feel like the one that got the most attention and it really was not, this is not an alien thing, there was a strong rule of thumb like your buffer cash hit ratio is like the thing and if that's not over 90% start adding memory or figuring it out. Now us DBAs back in the 2000s, we're not a dumb bunch of people, we're pretty clever and we came up with some great solutions to this, I'm pretty sure this is still out on Connor McDonald's website, he published the choose any hit ratio script which you can run on your database and it will bring the hit ratio to whatever you need it to be to make your consultants happy. Okay, so more seriously though, there's nothing wrong with ratios and counters and like using this kind of an approach, but it's not enough and while out in the industry we were kind of doing this, there were a group of people off in Redwood Shores kind of in a back room hacking on kernels, database kernels who were starting to get very frustrated and banging their heads against a wall and trying to really running into the limits of what you could do with just counters and metrics. And I circled some of the year just so you can get a sense of timeframe here. This is like 1991 to 92 when like this team is starting to run into these barriers. And out of the way I've heard the story told was this was an act of sheer desperation. Like it wasn't something that's anybody set out to do, but just with no metrics that was leading them to figure out where they were hitting bottlenecks as they were trying to like push assistant for more throughput, they finally just went in there and just traced everything. They're just like, just trace everything, start adding all this plumbing in there and for much of the 90s, there was actually ended up being a good deal of effort in this particular database kernel to add a whole bunch of instrumentation and plumbing around tracing at a very low level. This started to lay the groundwork for something that really changed the entire field of database performance over the coming decades. So the plumbing was getting in place in the 1990s and in the 2000s there was kind of a decade of dissemination and this is, I pulled these quotes out of this one book that I like because it's just a bunch of stories. But there's a whole bunch of things that you can pull up that were written over the 2000s. And there's a lot that happened but there's sort of two key things that I think are worth highlighting that everybody should really know about. The first that I would sort of want to talk about is sort of an application of queuing theory into database performance and especially a focus on response time. And I think one of the things to realize is that part of why this took so long is that there was a very persistent belief for a long time that there wasn't any objective way to actually measure response time. There was a lot of concern maybe not unfounded about measurement overhead and measurement intrusion, right? So it took some time and it took Oracle kind of proving that it could be done really in the 90s and 2000s that if you did it right, it was possible to do very low level tracing and measurement of response time information without having any discernible overhead on the system. But that, so that's kind of step number one is the focus on response time and the focus on tracing and the application of queuing theory. And the hallmark of this is anytime you see this R equals S plus W, that's kind of a hallmark that somebody's talking about queuing theory. And the R in there is for response time. The second big shift that happened was sampling. And this kind of came right on the heels of the first. And in particular, there's this idea of active session sampling. And in Postgres terminology, you could think of this as a connection or an database. It's a connection to your database, an active connection is one that's currently running a SQL statement or doing some kind of a command, right? So there was this, it sounds really simple. It turns out to be very transformational to take this approach of taking, at a regular interval, samples of what's going on, just what's currently active on the system right now. Let's take a snapshot of it, on regular intervals. So this was the second thing. I stuck two kind of books up here. And these were, if you wanted to go read something, I wouldn't necessarily go read these because there's actually newer stuff that, you could go read that. But I put these up, these were just two of the ones that I remember like getting a lot of circulation, really changing people's perspectives about how they approach database performance. And like the thing I want to point out is again, the timeframe. This is 2003 and 2004. That's 20 years ago now. It's just, it's been a long time. So what about Postgres? At this point, I have now set the stage and I am ready to introduce the star of the show today, the main character of our story, which is, wait events. Okay. Round of applause for wait events. So this has actually been all over the slides I've been talking about. This is the IBM system management framework, sorry, system management facilities documentation I was talking about. And this already had the concept of tracking system wait time and CPU time, which was sort of your foundational response time and foundational queuing theory stuff. But not only that, these pages that I had pulled out from Kerry Millsap's book had wait events mentioned over here. And when you talk about that formula, I didn't mention the S and the W, but it's response time, the basic formula of queuing theory is response time, service time and wait time. And that wait time again, correlates directly to what wait events are. So there's a whole bunch of like mathematics and there's a whole bunch of theory that's built, you can build up on top of these basic formulas and basic ideas if you do it right. In active session sampling, I had this slide up and in the way that, oh, and something I should have mentioned earlier, another sort of key piece was the visualizations here. And there's a picture over on the left, but that visualization, which was done, in fact, John Bresnewitz sent me this picture of his notebook and he said he's pretty sure this was his notes from the meeting when they basically came up with the formula for average active sessions and a bunch of people sitting around a, I don't know, around a conference table somewhere just hashing out ideas. And in those visualizations, the different colors again, represent different wait events. So now I'll kind of get into some more details as we go. But wait events have been kind of all over all of these slides I've been talking about and wait events were in fact the thing that was added by those frustrated kernel engineers in Oracle version 7012 back in 1992. They were, and they're exposed in Oracle's equivalent of PG stat activity, what you would have in Postgres. And I think this is really important context because when Postgres introduced wait events, it didn't happen in a vacuum. We were building on really some concepts that were much larger across the industry and a lot of theory and a lot of thinking that existed across the industry already as well. So what exactly is a wait event? Why do we call them wait events? I have a quote from Gopalakrishnan's book which is a good book for a theoretical framework of all this, but I'll simplify it a little bit. The high level idea behind a wait event is that you're kind of separating out, there's CPU and then there's weights and a weight is any time that you're not on the CPU. So a couple general, this is the core idea behind it. There are some caveats, it's not perfect. It doesn't necessarily need to be perfect to be able to use it, to do a lot of really powerful things. But a couple sort of general rules of thumb, it's generally, you're thinking here in a single dimension. So this isn't about like stacks or like you don't have weights embedded inside of other weights. If you were thinking in like open telemetry terminology, this would be a span, not a trace. So if you're kind of like, if you're able to think about it that way, if that helps. It does fit into larger telemetry or an application performance framework, but weight events themselves best fit within the paradigm of one process operating at one tier, right? Just kind of with this particular model. It does kind of get interesting when you start looking at distributed systems or multi-tier systems and trying to figure out how to visualize weight events. It's generally not something you use for CPU profiling. Typically, CPU is sort of left as just CPU. There are also like things like PERF and EBPF turn out to be pretty easy to troubleshoot CPU things. When you're off the CPU tends to be also when it gets a bit trickier. And also when the system starts exhibiting different kinds of properties, which is why this kind of a theory and this kind of approach works really well. So it's not as though Postgres has been oblivious to all of this. The earliest email I found was 2009 and I'm sure there's emails even older than this talking about sampling approaches and profiling approaches that really demonstrate an awareness of what was going on in the larger database performance industry at the time. And you can see lots of emails like. It's just, the Postgres community has tracked with this and has been very aware of what's been going on in the sort of larger industry. All the way back in 2007, this is kind of a fun one. Jonah Harris was one of the key people in implementing a weight interface for enterprise DB's product that's a derivative of Postgres. Strangely, the EDB marketing team did not take Jonah's suggestion to name it Margarita. I like it. They wanted something a little more boring so they named it Drita. But that goes back pretty far. But all of this being said, as of the year 2016, this was still the state of the world and open source community like upstream Postgres. So when you looked at a single connection coming into the database, the only thing that you knew was you had this one bullion field that said waiting and it would tell you if it's waiting on a lock and that's not a wait event. That doesn't tell you if you're off the CPU versus on the CPU or give you any kind of information about what you're waiting for. Which is the sort of fundamental concept, a wait event. If you're waiting, what are you waiting for? You're waiting for a lock? You're waiting for an I.O.? Are you waiting for something else? There's a whole bunch of things it could be. So this kind of sets the stage. Now I have taken Robert Haas's email to the hackers and I think I've presented this in an appropriate way because the Postgres community finally did get fed up with just always using debuggers and tracing tools on production databases to diagnose their wedge systems. So it took a little while. But June 30th of 2015, so well June 22nd, Robert Haas put this email out which was his RFC to replace that waiting column. I just showed you something more descriptive. And eight days later on June 30th, Amit Kapila dropped the first patch in sort of typical Postgres fashion. There were nine months of testing and iterating and documenting and collaborating. There were developers from multiple companies, multiple continents involved in this. The patch went through about 20 iterations before it finally got committed and that first patch went in March 10th of 2016 and landed in Postgres 9.6. Now I sort of consider Postgres version 10 to be really the first kind of complete implementation of wait events because there was still a lot of stuff missing. In particular the I.O., that's pretty key and that came in Postgres 10. But Postgres 9, one of the things that's beautiful and that I love about this is that this was, so that first patch was 200 lines of code and the second patch was, it's in my notes here, 800 lines of code. And to me, that really speaks a lot to how elegantly this was done. That's for what this is doing, adding instrumentation of every single lightweight and heavyweight lock, all the buffer pins and auto-generate the names. To do all of that, for a code base, the size of the Postgres code base and do it in only 1,000 lines of code is pretty stunning. It was a very nice patch. It was a very nice patch. This is pretty classic Postgres 2 to kind of iterate to put the first bit out there, the core bit out there, let it bake and then add the next layer in the next major version. You see a similar pattern with things like partitioning and logical replication and all of these things. Didn't stop in version 11. It has continued down and at this point wait events are pretty much a core part of the database. That is continually developed along with all the new features that come in. I highlighted sort of two things just to kind of point them out. In version 13, there was just a rename of a couple of weight events. So if you interact with weight events, you might notice if you're looking at a version 12 database, you'll see like the snake case and some of them they just convert it to be camel case. Just not major rename but consistency type of thing, but it can still look funny on a console or a dashboard if you're not used to it. In then version 17, which will be coming out, we have what's coming is the ability to do custom weight event types. So I expect that we'll start seeing a whole bunch of new weight event types appearing that we didn't have before. There are some gaps that you can bump into after migrating to Postgres if you're used to working with a weight interface or with weight events on another platform. I will sort of say like we have, so Postgres 17 is getting to the end of the development cycle. I think the last commit fest just wrapped up if I understand that correctly or it's just about to wrap up. So we're getting real close to kind of knowing what's gonna be in PG 17. So I'm just gonna think ahead to PG 18 and I'll say if my personal wish list, if there was two things we could work on, it would be great to start thinking about how we can address, okay, this is me being just kind of my opinion and other people will prioritize differently. Number five though I think has just really, I feel like I've seen frustrated a lot of people and it's the commit statement. The issue is that it's easy to commit is the point where you have to weight a lot of times and so it's easy to kind of imagine a lot of different scenarios where the commit ends up being sort of a bottleneck or a thing that you end up waiting for. The trouble is that when you go to see what the SQL statement is, it just says commit and you're like, well that's not helpful which commit is it? My application could be doing any of a million things here. It would be really helpful. With most other SQL statements, you can look at the SQL statement and you're like, oh I know or I can pretty easily figure out where that's coming from in my application. Some people even do clever tricks like putting a comment in the front of their SQL statement that kind of tips you off toward a where in the source code the SQL is coming from. But commits are, you can maybe do comments on commits. I've never seen people do that. If you're using the auto commit protocol you can't even do that. That's one thing I'd love to try to brainstorm a good solution for is some kind of a way to track commits back to which application transaction. And the second one that I'd love to see would be the number two was just kind of the counters and cumulative times which we don't have yet in Community Postgres. A lot of the code that we need is there. It's probably just a matter of adding some data structures but some things, if I have some time I'll try to work on it too. Brainstorm along with all of us. Okay. So this is a bit of an introduction and a bit of a background but the real reason I think everybody's here today, well maybe not everybody but I think a lot of people understanding the background, the philosophy is nice but what you really wanna know is okay I've got a problem, how do I fix my problem? Right, how do I actually use this stuff on my database, whether you're a developer, whether you're an administrator, how can you do this? So the gold standard here is MethodR. I'm not gonna kind of launch into a full standard. That's my perspective on it at least. And the reason I actually think, everybody should learn MethodR, I think. I'm just gonna say that. I think everybody should. And my first reason for saying that is it is the best thing I have ever seen to teach you how to ask the right questions at the very beginning of the process. A lot of the MethodR material, it teaches you at the very beginning how to, so just a simple example of this is that like I've had lots of interactions with customers or with Postgres users where somebody comes in and say I've got a big problem and okay what's the problem? They're like well this one metric doesn't look right. And just pick your metric. And okay well we could dig into that but sometimes you can spend like two weeks digging into this one metric whether it's some kind of a hit ratio, whether it's like a latency metric. And then you can make an improvement and after you put two weeks into this thing and then maybe it doesn't actually have any impact on your actual application or your end users. And what MethodR really teaches you to do is to take the step back and start at the very beginning of like okay we need to really start with user activities. Start with user actions. Define things in that way and then work backwards from there. That way you know that the work you're doing is gonna have an impact on the business, have an impact on your application and on the end users in a measurable way. It's very, very good for that. We're not able yet to do tracing in Postgres but what we have is the ability to do the sampling based approach that I talked at the beginning. And so here's a basic methodology that this is a methodology I'm gonna walk you through which is based on a sampling approach. This methodology, I didn't really invent most of this but this is what I have seen and a lot of people have seen. The reason to follow a method like this is it is just the fastest way to get to understanding a problem and finding a way to resolve it and move forward. It's just faster than anything else I've seen. Scrolling the wall of metrics and looking for the one that jumps out at you or kind of like using your intuition about like well I've seen this one problem six other times we should just first look at this problem because it's such a common problem. You really can find some great wins with these ways. You can also sometimes sort of end up, maybe not so anything bad happens but you lose two weeks and then you're kind of like starting over. This methodology is the most reliable thing I've ever seen to get you to a problem and an answer fast that actually makes an impact. So let's walk through it. The starting point is having some kind of repository of historical performance data and what this actually means is just that PG stat activity view in Postgres and I've kind of snapshot of it here and what you want to be doing is just taking snapshots of it on some kind of a regular interval. If you're doing a batch processing system you might only need to just grab a snapshot once a minute or something. If your job runs for 12 hours that might be enough data. A lot of transactional systems I've seen people do every five seconds or every one second. Those are pretty common sampling frequencies and that works fine. I think that's been proven to be pretty safe. But the key thing is you do the same interval. So you set a regular interval. There are some tools out there that can do this for you. You can do it yourself. So every one second. It can be as simple as this. This was from a blog post that I did a few weeks ago about UUIDs and all I did was just a little bash script that you can see up here and it's on the blog too. I sleep for 15 seconds and then I run a SQL query just dump everything from PgStatActivity in CSV format. I took that file, I opened it in Excel, I did a pivot table and then it did a stacked line chart and I got this. And this right here is a visualization of the average active sessions on the system. And that's, it can be as simple as that. You can do this with Excel in a CSV file. You can also, there are also much more sophisticated tools out there. In the RDS world, Performance Insights does this for you. It takes those samples and stores them and then allows you to slice it and dice it. Go to the Postgres Monitoring Wiki page. There's a whole bunch of tools now. It doesn't tell you which ones do wait events and which ones don't. I don't know everything. I do know that the POA, the Postgres Workload Analyzer does have the ability to do a wait report by using the PgWaitSampling extension. I believe Datadog, DBM also has some support PgAnalyze, has wait events instrumentation and there's probably more too that I don't know about. So there's a lot of sophisticated toolings available. POA is open source as well. So you know that that's an open source option if that's what you're looking for. So that's the first step is to have that repository. The next step is scope. First scoping and scoping is really important. And then the second thing after you scope is to grab your top SQL and to grab your top wait events for that top SQL. Now I sort of snapped a photo from Carrie Mills App's really, really old book but it's a good picture about scoping on the left here. The idea behind scoping is like in this picture there are seven people who are doing work in your database, all right? So we've got Preston, Alexander, Wallace, Nicholas in there. And what we're interested in is we're interested in Wallace's problem. Wallace had a problem. Not Nicholas, but Wallace. And his problem only, his performance problem only happened between sort of time T1 and T2. So what scoping means is I've got all this data. These are all these snapshots that were taken of PGSTAD activity. Imagine that just each one of those vertical lines is PGSTAD activity. I've got seven connections to the database, one for each person and they're each doing some work. And what we wanna do is we wanna solve Wallace's performance problem between T1 and T2. And this is actually a counter example and what Carrie is showing in this illustration and it says it in the description is like in this example we have correctly scoped down to the right user but we haven't correctly scoped the time frame. And so when you just sort of look at aggregate all of that data you're still not gonna quite get the right answer. So what scoping means is that you wanna get as narrow as possible down to the specific user, the specific action, the specific time frame that you're trying to problem solve for. And this will follow on from, when you're kind of doing a method R approach the first question you asked which was let's just get specific about the user impact and kind of the user story that we're trying to solve here. So once you've narrowed it down then you're going to basically summarize the data that you're looking at there and that's grabbing your top SQL. You can just kind of count samples at that point. So aggregate and sort. I have an example just from a chat I had with somebody on the Postgres Community Slack a while back, it was a long time ago now. But in this example there you just, you can see what they're doing is counting the number of sessions and then the state and the weight event. And it's kind of as simple as that. You can grab your top weight event and then you can, sorry, you grab your top SQL statement from those periods because you can see the SQL statements and PgStat activity. And then you'll be able to see, you can filter on that SQL statement and see what are the weight events that are showing up. Yes. In his example I think D was disk, C was CPU, S was synchronization like a lock or something. It's not so much the importance of specifics here. Yeah, I mean the idea is that they're different like either the seed, it could be on the CPU or it could be waiting for something. And the profile will show you as you start to sort of count you can see 80% of the time we were sitting on the disk which will kind of come to. So then the next thing, top SQL and top weight events, you have a top SQL. And what you're gonna do is you're gonna break down your top SQL. So you're gonna break it down in two dimensions. First you're gonna look at the weights for that SQL. The second thing you're gonna do is look at the plan for that SQL. And this is kind of like almost every time that I'm interacting with a Postgres user or a customer of AWS and it's a performance question. It says the first two questions are always the same. It's like do you have a performance insight screenshot and do you have your explain plans? This is always that. I mean much of the time you ask those questions and somebody goes and looks at performance insights if they didn't already and they're like, oh, I see the problem and they fix it. Or look at the explain plan and if somebody's familiar with reading an explain plan or if they're able to figure it out a lot of times that takes you right to the answer. So the explain, it's going to give you, if you're not familiar with explain, it's a command, it's a SQL command that you can use. You give it your SQL statement and it tells you how it's going to execute your query. Most people in this room said you were familiar with Postgres so I assume you're pretty familiar with it. I just like to run through. I bet most of you have already seen these but there are some tools out there to help with explain. I'm a text oriented kind of person so I'm a big fan of Depeche's website because it's very text oriented. I do like to point out if you didn't know this you can obfuscate queries if you use this tool and another thing is that FYI this is available under a BSD license as well so it is possible if you work in like a regulated industry or if you're in a large company and there are data sensitivity things. You can stand up a version behind the firewall that way you're not posting. You may not be able to post SQL plans to a set on the internet but if you can that's even easier because his website is great and it'll give you something like this. So you put your SQL plan in there and it formats it very nicely. I tend to click the button that says exclusive color mode. The reason is back to method R focusing on time and response time. This draws my eyes to the line. What it does is it looks at the time that each step of the plan is taking and it takes your eyes straight to the one that's taking the most time which is where you wanna start. I've looked at plans that were hundreds and hundreds of lines long. Like I think if you're imagining a plan that's like only five or six lines it's really simple. It's probably not as big of a deal. When you start to deal with people that have like or plans that have like multiple sub queries maybe correlated sub queries doing a whole bunch of joins and you can get these plans that are very, very large and take a lot of work to read. So having a tool like this can be really helpful. Now if you're more of a visual person instead of a textual person there are visual tools out there as well. PG Admin is on the right here. Delivos PEV tool I have on the left. Both of these are under Postgres license and available as well. And then one final plug I like to stick in here is for a tool that I've put out on GitHub called Decef. This is a tool, it's just kind of a wrapper around explain. So yeah, it's a wrapper around explain. What it does is it kind of instead of just giving you the explain it creates a whole report. So the explain is in the middle but then what it does is on the top it gives you the time stamp of exactly when the explain was run. Exactly what version it's running against of Postgres. And then it knows the version so it's smart enough. It will actually make sure to include all of the explain options for your version. So if you're on Postgres 15 it knows you can add a couple other options to get more data. Automatically puts the options on, dumps the plan. The other clever little trick that it has that I really like is that it's able to detect all of the tables that were accessed by your query. And then in this third section here what it's actually doing is it's dumping out all of the optimizer statistics for just the objects accessed by the query. So it's got things like the number of distinct values, the cardinality of the tables, the number of blocks for each of the indexes. It'll tell you the selectivity cardinality of those indexes, like all of those stats. And that's really helpful just to give a very concrete example of where this is helpful is literally just last week I was working with a Postgres user based in France. And in this particular case there is medical data involved. And because there's medical data involved all of my interactions with this customer they happen to be an AWS customer. I'm not allowed anywhere near that system. So all of the interactions consist of like, okay, can you run this explain and put, it's all through the case, right? Can you request information? Can you run this explain, post things into the case? Tools like this are really helpful for just reducing the number of round trips. And if you're in like a consultancy type of line of work, even internally if you're in the like the central DBA team and you work with dev teams it can, I like tools like this because they can just kind of like gets you a whole bunch of information in one shot. So you don't have to go back and forth a bunch of times. It pairs well, I have a little note in the corner with this thing called PG Collector just a quick plug for that too. That's kind of a neat little tool that a coworker of mine, Mohammed maintains. And it's just a little script that you can run that generates a report for the instance with things like the number of tables, the largest tables. It's just kind of those things where it's kind of reducing the back and forth with somebody when you don't have direct access to the system and you're trying to help them out. So those are a couple of tools around doing the explain and the very last step is once you have this information then you finally do your investigation. So you've got your repository, you've scoped down your problem so you know exactly which data you're looking at. For that data you've figured out your top SQL statement. For your top SQL statement you've gotten the weight events and the plan and now you start to look. And what you look at is what is the top weight event that you're looking at? And in the plan what is the top step? So your plan, for example, if there's three tables maybe it starts with table A and it uses an index. Maybe then it joins that data in with table B and it maybe uses a hash join. What you're gonna do is focus on the one step that takes the most time. Quick word on this. I mentioned Depeche's website before. There's a clever little trick with this website which is that all of these plan nodes are clickable. So this is another thing. You don't need to try to learn everything about the Postgres Optimizer today to become an expert. You can learn a lot of this stuff as you go. If here, if you don't know what an index scan is you just sort of take your mouse like this. You hover over the word index scan and then you click on it and it takes you to a page like this. And he explains right on his website what an index scan is. He gives some examples, how it works. That's just, I find that to be a really helpful tool and it helps you to learn as you go when you're working with Postgres. And a lot of you in the room said you have quite a bit of experience with Postgres. These are great tools to pass along to your developers as well that you work with and that you support. The, as far as wait events go, the best reference that I'm aware of right now, I'm not just saying this because they paid me to say it, is the RDS documentation. And this was just a really fantastic effort and project that I've talked about a little bit. But they have a, in their docs, they've set up a section where they go through a whole bunch of the most common wait events. And for each one of those wait events, there's a pretty long page kind of like this where it tells you a little bit of context and background. Some likely causes and some suggested sort of actions. So if you're seeing this wait event, what might you want to consider to optimize a performance of like whatever SQL statement is causing this wait? So yeah, well this is just documentation and so that's a good question because you asked about Aurora. And I do just want to mention there is separate documentation for Aurora from the RDS like open source Postgres documentation. So you could accidentally land on the wrong doc page like if you searched on Google. The Aurora stocks are going to have information that is specific to Aurora. If you are working on community open source Postgres, this, the RDS open source documentation will apply the same to open source Postgres. All of this wait event stuff, it's the same. This where like RDS Postgres is not changing like the architecture of Postgres. It's Postgres running in an EC2 instance with an EBS volume underneath it. So it's all of the advice here. Now the one thing that you will bump into like, it might mention a specific cloud watch metric like looking at your EBS IOPS. If you're running in a different, if you're running on your own server, on your own rack, you'll have to translate that into like, okay, what are the specs of my hard drive? Or what are the specs of my storage array? So, but those translations aren't hard to do and it's not changing kind of the basic architecture of how the advice is applied. You don't wanna land on the Aurora docs though. Those are, I mean, a lot of it's the same but since they're both there, just go to the RDS docs that will apply correctly if you're on community Postgres. So I have one quick example to shoot through and then I don't wanna keep you too long because I bet you're hungry. So let's walk through one kind of scenario of putting some of this into practice because I think this is this, seeing the visuals are what really help, I think kind of like make it click. So for this scenario, we're gonna pretend that we're a small bank but we're very, very busy, all right. And then when I say we're a small bank, I mean, I mean we're pretty small. We've only got 10 branches around the Los Angeles area, all right, at each of our branches, we have 10 tellers who work at that branch and each of those branches, our bank branches around Los Angeles has about 100,000 accounts, bank accounts. And so this is all we really need to track in our database. We have our branches, we have our tellers, we have our accounts, oh and one more thing, very important is the auditors, the regulators. So of course we have to keep an audit history and we are a bank, so what we do is we keep track of money. So the main thing that we're going to be doing is that people come into our branches and they deposit or withdraw money. So our basic transaction that we do as a bank is pretty simple, I have it on the left side of the screen. The first thing you do when somebody comes into the bank and they walk up to a teller is we're going to just update their account and set the balance to whatever the new balance is. And then we're going to be pulling the balance out, the new balance out so that we can put it on their receipt, okay. And then just a few more things. We are a bank and we do wanna keep track of how much money changes hands between the tellers who are working at that bank. And then we also wanna keep track of how much money is in this branch total. So we're also going to do that, we're gonna update, keep track for the teller, keep track for the branch. And the very last thing of course will be to insert our audit records into our history table. So pretty simple, this is pretty simple. Now, most of you have probably caught on by now that I did not come up with this idea, this was in fact invented by Jim Gray and first published on April Fool's Day in 1985, although it was no April Fool's joke and in fact, this paper from Jim Gray changed the database performance field forever. Something you may not have caught, but is really fun if you didn't know this story is that this is called the debit credit benchmark, was what it was originally called. And Jim Gray came up with this kind of bank idea. If you haven't heard the stories, he didn't just invent this, he actually worked with a real bank in the 1980s who were specking out and buying a new computer system and they wanted to figure out where they could get the best price for transaction. And I think it's a great story, like they ended up getting two bids, he sort of mentions it in his article that he published. One of the bids was for $5 million, the other from the startup, the other was 25 million from a very well-established tech company, they went with the $5 million bid and they're target, like for $5 million, okay? They wanted to buy a computer system that could do at least 100 transactions per second, at least 100 transactions per second. So that's what we were shooting for. So what happens if we run this? Now, of course, this is an old benchmark. You all probably realize by now that this is packaged with Postgres out of the box, it's called PG Bench, built right in, very easy to run and test. So what happens if we run this? Well, we are smoking fast. We get a whopping 6,000 transactions per second and you did not have to spend $5 million for that, thankfully. So this is pretty cool. So what I've done here is I just ran PG Bench, I've got the commands, you could run this yourself if you want to try it out. This was on a Xeon Ice Lake box with a half a terabyte of memory and 64 threads, a 64 thread processor. And yeah, I topped out. We topped out around 6,000 transactions per second. Now this is kind of more of the way that I structured this, this is really more of a throughput test, right? But here's a question. So okay, we know we can get 6,000 transactions per second. So let's fire this thing up and I'm gonna connect 100 clients and just run my 100 clients full bore. So you see from the bottom here at 100 clients, that's right at our peak. We're running around 6,000, little over 6,000 TPS. So when I fire up my 100 connections, what would you expect to see on the server? What would you expect to see? Let's try the old method first. The old method was where we pulled up all of our counters, our metrics. Now we are pushing this system to the saturation point. Like this system cannot push any more transactions per second through than what we're doing. We're at the limit, 6,000. If we wanted more, I don't know, we probably need to get like a very fancy distributed system or something, right? So here we've got our 6,000 transactions per second. We're getting our super high throughput without spending $5 million. But if you look at this, the CPU utilization is like 5% on this box. Our disk metrics are, the disk isn't even breaking a sweat. The latency is down under a millisecond. The throughput is like, doesn't even register compared to what our limits are. The buffer cash rate ratio is at 99.94%, which is very good. And you can scroll through pages and pages and pages of CloudWatch metrics and system metrics. And let me tell you something, you're actually not going to find anything that looks suspicious. Every single, by every single metric, system metric that you look at, this system looks idle. It looks idle. So why are we topping out? Well, let's try the new method. The new method, the right method, is to start with wait events just to look at sort of a profile built on response time, built on wait events. So we're going to go to that visualization I was sort of talking about at the beginning where we do average active sessions and we sort of stack them vertically. So I'm going to just sort of walk through a few things in this screenshot. First of all, right off the bat, it's a very different picture. All of a sudden, the system looks very loaded. I have a black arrow on the left and it's kind of this light gray line. That's the number of cores of CPUs I have on the box. There's 64 threads I talked about. And we're running like 100 connections, 100 processes on this thing are running, are busy all the time. So this is a load, which lines up, we had 100 connections, right? So 100 processes are running. Suddenly it looks very busy. Now if we go down to the bottom, we're actually going to work over from the bottom up, I realize it's going to be probably hard to see in the back. But on the very bottom, one of the things that Performance Insights does, that is nice. Now this was a four minute window of time, okay? And what it's done here is it's, we remember for four minutes, what we did is every one second, I grabbed a snapshot of PgStat activity and this is summarizing all of those snapshots. Four minutes, we're talking four times six, 240 data points here. All right, and then we've summarized them. So what we see is there's basically two SQL statements that are dominating that data and the numbers mean something. The numbers mean that on average, during that four minutes, if you grabbed any average point in time, about on average 48.80 connections are typically trying to update the tellers table. 37.81 connections are probably trying to update the branches table. And then the second thing is you have the color bars and the color bars tell you for that specific SQL statement, what are the weight events? And the ratios are actually pretty similar. The ratio is probably about like 60, 40, something like that. And about 60% of it is that kind of light brown color, about 40% of that is a more reddish looking color and it all correlates. Those two weight events are a tuple lock and a transaction ID lock. The real power of this and what I really want you to take away isn't necessarily the specific events, but it's how when you start with a dashboard like this, it just immediately draws your eyes to the problem. And this doesn't just work for tuple locks and transaction locks. It still works for all of the classic old problems like just CPU being saturated, disk being slow. It works on all of these. Like it will immediately draw you to what's the bottleneck in the system right now. So if you're not familiar with the tuple and the transaction lock, you just Google it, go to the RDS docs and basically it's gonna tell you really simply, you just have a hot record. Which actually isn't that surprising when you think about it because this is a database. If everybody's trying to update for this one branch at the same time, you know, we have to sort of, the updates can't all happen at the same time where you would lose data, right? So if you have three updates, you have to wait for the first transaction to commit or roll back because you have to know is it gonna commit or is it gonna roll back so the next guy can know what's he starting with to do his math. So I'm just gonna, there's a lot of ways you could address this, but we're gonna do it real simple. Just kind of simple thing, it's just instead of having every transaction update, tellers and branches will schedule a job, run it every few seconds, update them in batch. This is such a common thing. It's built right into PG bench to skip steps four and five. You just tell PG bench run simple update mode and here's what we get if we do this. I've put the scripts again on the screen so you could run them, try it out yourself. This simple optimization to our application has taken us from 6,000 transactions per second up over 30,000 transactions per second. So we've got like a 5x speedup. Now, again, like don't, the numbers aren't really the point, but the important takeaway here, the really important takeaway is there's a couple of things. One is this, if you have a hot record, is the distributed system gonna make you any faster? Is a bigger server with more CPUs gonna make you any faster? No, right? Not in this case, because you're just, everybody's trying to update the same row. It doesn't really matter how many CPUs or how much scale out you have. So understanding like what your bottlenecks are is really crucial to understanding how to scale your load. This was a bit of an example based a little bit, made up, but this does work in real life. And a nice example of that was if you weren't tracking with it, Mid Journey had an incident in May of last year that Kyla Haley gave some kind of high level blogging about. And it's a great object, it's a great example of how powerful weight events are. So in the Mid Journey incident, they had an eight terabyte non-partition table in Postgres, this is all in Postgres. Outbound logical replication on that table, they were running eight to 10,000 queries per second on it. And in May, they had two incidents four days apart where all of a sudden the error rates at the application level just shot through the roof. And it came back to database performance. And these are the worst ones, right? Because there's no errors. It's like a brown out. And it's just things kind of start slowing down. But the amazing thing about it, and this was that they had all their dashboards set up and ready to go, is that you can see the screenshots over here on the right. The application error rate started spiking up to 1,000 errors a second, around 737. I don't know if that's UTC or if that's local time, but around 737. And directly below it is the average active sessions that they had. And what they saw, it just took, this was the first place that their main database operator looked. And you could immediately see that at the same time as the application error spike, there was this massive spike, 500 connections to the database sitting there on something that says LWLock Lock Manager, is what they bumped into. Now, Kyle did some looking around. That gave him the starting point that he needed to very quickly trace this right back to four weeks prior. And again, these are like the worst kinds of issues because like, if the migration would have just failed, you could like roll back, right? If the migration hit an issue, like even three hours later, you can kind of deal with it. Four weeks have gone by. Four weeks have gone by. They're in a much more difficult position. You can't just like roll it back. It's not that simple anymore, right? But they had done a migration to partitioning four weeks prior and they didn't roll back the partitioning, but that turned out to be one of the factors that was contributing to increased pressure on the Lock Manager system. If you're curious about the details, I've got a blog post that you can go read. But I'm very aware everybody's hungry, so I'm gonna keep going. I mentioned before it doesn't just work on these. Now the Lock Manager is not a common wait event. Neither is buffer content. The things that are common that you tend to see the most are things like the transaction locks, CPU, data file reads, wall writes for read and write paths. And there's some good documentation that will walk it through it. Again, one last time, gonna put this slide up because I want this to be the last thing that you see and remember your methodology for solving performance problems on Postgres. Time and walk through it this way. Scope it down. Your top SQL, get your top wait events, get your explain plan. And I always end every presentation I do with what I call a happiness sense. You can find these on my website. These are a collection of things that I think every Postgres production database environment should have. Most of these are not controversial. None of these are controversial and most of them pretty much everybody I know in the field agrees on. There's a couple where my opinions sneak in like the scaling area, but it's not controversial stuff. And just to mention it, there are a couple of these that directly relate to this entire topic that we just talked about of wait events. In particular, you'll see on the right hand side, it says active session monitoring. It's basically saying, as a best practice, do active session monitoring. Have something in place that's grabbing snapshots of your active sessions, of PG Stat activity so that you have the data, have it always on. If there's some kind of an incident, you're able to go back and look at it and do analysis. There's lots of ways you can do that. And the one other thing that's on here is under the alarming section, the number one alarm. I have it in bold because if you only have one alarm, this is the alarm you should have. And it is the alarm on average active sessions. On AWS, it's also sometimes called database load. This alarm, I used it for years as an operational DBA. It's an amazing metric and it's amazing alarm. There's very few false positives and it works very well. You tend to be the first person to know whenever there's a database performance issue. Like your pager will go off or you'll get the email and you'll know, you'll be the first to know. It's a very, very good metric to have. Basically, you're just looking at the number of active sessions and then figuring out the right threshold and setting up your paging based on that. All right, so that's it. Thank you for coming by. If you have questions, I will hang around and I'm happy to hang out and chat and discuss, but I want to kind of make sure everybody's dismissed and free to go and then we can just sort of ask questions up here in case folks are hungry. And then I'll jump back to this one. Oh, that was, I'll find it. I was like, do I remember everything is waiting for the customer to see us? Yes. Yes. How long? Yeah. That was a lot. And then you were asking about. Yeah. I think it should as many as you did it. I mean, I'll do that. You're not going to put attention on this all. You're not going to put attention on this all. It's okay. So it's great here. Yeah. The English. Yeah. The English. Yeah. The English. Actually, my background is from Java. I'm 113. Is this English or? Yeah. Instead of the hot record, you can see it. The hot record. I know that. One slide. When you look at the top of the equals, it's like there's a color and color and updated content. So what that tells you just immediately is that your hot records are inside of those two tables. And so then what I did, and this is really, I think this will really make a point, like can you just skip over this part of it? Like I wanted to, this is still, you're a bank. You really, you need to know the balance of the branch. You need to know how much money change has to tell us. So you can't just skip those transactions. So what I did is I did this, I made a, like a hash job. I was having a few seconds. Oh, oh, it was a liquor. I told you. Well, I opened it out of the regular hash. And I put it into just like a batch one. Yeah, like a hard top. It just runs out of a few seconds. And what this does is I'm based on, it pulls from the history table. I wanted each teller to do it, and just I'm based on, it was the accounts table, not the items in the branch. So you're getting the same data. You're just gonna get a graph of what's gonna be inside of every transaction. So because these two were intentional, I pulled out the regular transaction. I pulled the cash out. Now that's like the one way you can solve it. There's actually, like, there's really a lot of fancy stuff that you can do, like in a database in terms, like what you can do, solution things. Like, I know you're only having to crack the value into a lot of options in there. So let's get walking, and then, but that was a significant one here. Thank you very much. And then the code for it was, a lot of the cash, actually. Yeah, thank you very much. It's just a few batches. I think that I need some more data for it. Well, I think they're very, very basic tools that are there for a lot of people. The thing that's really nice about it is it's getting a lot better. Like, it doesn't, it doesn't be one batch that I have for you. It doesn't mean we don't want to put it in storage. It's just, you gotta do a lot of work to shop. Yeah. Well, I know. Yeah, it's worth it. Thank you. Thanks. I'm glad to be here. It was great to talk to you. I didn't even get a graph that's gonna run. Oh. Yeah. Yeah. Yeah. Well, you can see how you're gonna have that, or you can see how it's gonna run. And it, it'll do that stack up. It, it helps you go back to that little one on my mind. It'll help you get it right. It just gives you a great look at that. Do you have anything else to spend on it? This way. No, don't give it to me. I don't have a problem with it. I'm just gonna leave it on my desk. Yeah. Yeah, that's the answer. I see two hundred feet down there. Are you, can I get your body? I need the, I need my body. Do you want the back of my leg? Yeah. I'm more comfortable on the bottom. Yeah. I see feet. I'll get right feet. You look a lot like you. Keep your eyes on the floor. Yeah. I'm just gonna try to stick it on my heart. You don't want to make it look like you're gonna go? Well, you know what it is that you're gonna be. And then ask if you need something to spend on it. How do you get it out of there? You should find them, be here. I'll throw them. Keep, keep right here. You really should try that. We really don't know if you're part of it or not. One point, there's a difference between you're trying to look at these big plans, one is a number, one is a number, one is a number, and one is a number, yes. And we're trying to get the volume of the final statement when it's returned to the data, the index guide, the icon is different. It looks like one is just a straight forward index guide and the other one has two brackets on it, where it's looking right, it might be where they checked in, and then we do it again, because the issue is the sort of order of the data in the back, how it would look like. Do you guys have anything to add to that? No. Are you willing to do that? Yeah, I just wanted to say thank you. I'm gonna start packing. Sure, sure. So index scan and it'll look, okay, I should have taken a picture of it. It's just weird, so what we did was, because it's only been a drop down on the UI, we just had it out of the filter to say you just sort it with what's in it. Okay, what's your first question? Oh, the first question. I don't like all these thoughts, but I don't want to like all the attention in what you were saying. The first question is, where the app will work on the publishing app, where we're trying to figure out the deltas, and we scan activity, history, and somebody edited a field and stuff like that, so we just want to push the tax we're trying to find out. So we scan our regular table and then we'll see if it's, if there was any deletion that was reinserted, something like that. Or if everyone was inserted before and there was a delete and they're inserted before the next publish, so it's a reverted deletion. That was the edge, don't worry, don't worry. Just trying to check. So we're ending up scanning the whole three table and it takes forever. I'm trying to optimize that and there are a lot of cases where we, all we do is update the stats, some types of words, and it's database for database, it's not consistent, it's actually, I think the way that the database inferno stores the table, that it calls that, I just want to figure out the way where that is locking up. Sometimes when we run it, it works, but at a certain time, when the database is under a sort of low max exposure, though it's very random. This is like classic, and then the trick is part of getting a trace and scoping it up. Yeah, we'll talk about it in a little bit, catching it. So, okay, I have a suggestion. This is the way, are you on our question? We're on one of these, yeah. I think there's an already, I'm not in a rush to skip this, but there is a trick if you could do it, which I mean, you know the application name in the old postgres, I've been able to see it in the past. So there's a thing called application name, and you can set it, it's like, you set it the same way you set it now, because they set application name to this max. What it is, when you set it, it shows up when you set activity as a column. So normally this is like the name of your application, like let's say PC4, but what you can do is you can take, so if you can go into your application, you have one process that's slow. Do you use connection polling? Yes. Do you have a job at? No, it's done. Okay, okay. So if you use this, when you pull the connection, I guess one of the questions, the process that you're having trouble with, does it pull a connection? Is it like all sort of one thing that works on the SQL? It actually doesn't matter. It's under a transaction, so it's, it should stay on same connection. It's stay on same connection, because if it gets connected, we want so much. So what you can do is when you pull the connection out from the pool, send the application name to a module, like module metrics, or what's, you set it to scan, like history scans, given the name, it makes sense, like history scans, set applications to scan, then go to all your work. How long does it come in? It's like minutes or seconds, or it's supposed to be like, for a large database, it's like three, five minutes. It's like six, six, seven, eight, it's like three, five minutes. But for a large database, it renews to like sometimes half an hour, and when it gets to an hour, a week, I don't know, or something like that. No, I mean, this is actually kind of what I get in your contact, I don't know how that works. Well, you can do a set the application in, you can get from, and then have the end of your process, but it's not a clear, so just set application in, what will happen, it should actually be safe, right? This is just a override, or the particular, I think it's not. You hear the word, it's not actually right. Yeah, so it's, like you said, it's sort of like a spack, because you want to, you want to create that particular thing. This is what you're going to do, I don't know actually, but so that way, I can basically, it's from the vacuum load, and everything else on the system, yeah. You can just see it, what you would like, what would be interesting is, if you get it right, it's not the second part, so it actually would be there, you should see, do you remember what I said earlier? No, that's the same thing. Yeah, because then when you, like if I, this is the, there's no option to say, again, I don't like it, there's nothing, that's just a little bit of it. I know we can show you, you should put a small, there's a big, there's a big, there's a big, there's a big, there's a big, there's a big, there's a big, there's very small, is there a module on it, as well, maybe, because if we're helping. It'll just, what will it even, it'll make it probably unclean, how will it happen? yeah, You can, well then you can put like a pen, opponents 就是, no, that's, that's fun, because people can use your pen, or like you need to put whatever you want under the application, so these are all Cookin Najnis, we really need, sunshine, It's just an application that's really important. So again, you can cap the experience of actually a document. Different, you know, multiple databases for multiple... It's catalog-based, it's publishing. So multiple catalogs, so each database... Each catalog is a database. But there's multiple catalogs of multiple things. So I'll put it on the topic. Sure. Because I don't really... Because I know this is kind of all of the work. I don't know... Maybe you can show us the final action. Okay, so the community... Of course, of course, I don't... I don't like this, because... I don't think that's something I should try to do a lot. You know, the PI career tool... What is PI? Perforce. Oh, yeah. Perforce and your tool. I feel like the first one to learn... Learn how to apply that... The JVM that I do this... Once I work with my business, one board... I do consulting, and that's my first one. You know, the other one is my... My business email, which I do both in church, when I have customers, actually. So, yeah, it's official because it's a business trip. But I also work for... That kind of company. And we're in a way different thing. Our DVAs are used for close friends. They have their own solar winds, I think. They use the monitor. Yeah. So, we weren't using application inside, so I'm not gaining as much. But maybe, I'll ask you to turn it on for this. Well, if you're on REF, you have performance in such a way. Okay. So, the way it works on REF is... It's just... Do you ask if we should turn it on? Because it's already... It's probably more than a year or more. Maybe it's where it is. Well, they may not hand out the software. Oh, yeah. We'll change the seasons. But you want to be able to use the monitoring tool. So, that usually happens one whole month. But it's there. You should actually turn it on. Because some of the data for support... Yeah. Yeah. Some of the data for you... That's how they do it. Okay. Suppose... Amazon or... I work for AWS. I work with... Yeah. I work with... Cool. Perfect. If the perfect price then... Okay. Yep. Yeah, thank you. I appreciate your time. And you're missing lunch. Or... Yeah. But yeah, this is... Yeah. For a very tricky problem to find, like... The customers are asking... Why not just publish fail or anything like that? And then... I am all over a large load of data. There's a slight load of data. This really works sometimes. I'm really curious to see what is going on. Okay. Maybe I... Oh, it's super low though. Yeah. Yeah, it's still working right now. Okay. Do you want to try this one? Well, it's going to go off. I have no idea. I have no idea. Test. Okay. So I checked earlier and I think my font might be a bit small. So if you want to read the slides... Maybe get a bit in the front. Just... My eyesight is getting bad. I couldn't read it from in the middle. But maybe your eyes are bad in the mind. Okay. So... Welcome back after the lunch break. I've been told there might be a sound check by a band in the other room in half an hour. So we'll see how that goes, whether we have to interrupt this program for a bit. Welcome everybody. My name is Michael Bank. I'm from Germany. Came all the way from Germany. I'm working for NetApp these days. That's a storage company that builds huge storage classes. But they also try to have their storage in the cloud and they have a cloud operations division now and Instacluster where I'm working at, got acquired by them. So we are now part of their cloud operations division. And before that I used to work for Creditive, which was a German open source consultancy and support company that got sold to Instacluster. So that's without changing companies. I've now been through three different companies. And today I'm going to talk about Patroni and Patroni deployment patterns. I have been working with Patroni for the last couple of years, besides other things. And I'm also the Debian developer, or Debian maintainer of the Patroni packages. Apart from that, I could also be said that I'm a Patroni contributor. I'm a Postgres contributor that's a bit easier because there's a dedicated contributor team who figures out who's a contributor, who's a major contributor and so on. For Patroni that doesn't exist, but we used to have a Patroni contributor meeting last year at the German Postgres conference and I was invited. And I'm organizing one this year, so I probably can call myself Patroni contributor these days. All right. So what is Patroni? Patroni is, they say a template, but in general it's a high availability solution for Postgres. So if you can't read it, just come a bit closer. I uploaded the slides to the... Thank you. Now it's in here probably, this one's slow, so you won't quickly switch it. Just have to switch the batteries one second. Right, you see? Test, test. Okay, still working. Okay. So, yeah, Patroni is a high availability solution for Postgres and I tend to think that high availability is complicated. So first off, if you are able to just do regular Postgres replication and you have somebody on call and you don't need five nines or whatever of uptime, maybe it's better to not use high availability, automatic failover because sometimes there's something, there's some jitter in the network and the timeouts are not configured right and then your high availability solution will initiate a failover and then that means that all the connections and transactions are rolled back and there is a downtime for at least a few seconds. It's really difficult to have a failover with almost zero downtime. So if you can avoid it, maybe that's the better solution I have to say upfront, but a lot of organization, a lot of companies do not have 24-7 on staff or they don't have staff that are able to wake up and make a manual failover in five minutes because that's also pretty difficult. You have to diagnose a bit what's going on and hit the right thing. So automatic failover solutions certainly have a place. Now Postgres itself has replication, but it does not provide high availability. There are some primitives like streaming replication. You can have asynchronous or synchronous streaming replication. Standby cloning is pretty easy with PG-Base backup these days. It's basically just the one command thing to clone a standby and then the second command to just start it. You have a hot standby, which means you can run read queries on the standbys for read scaling, and you can promote a standby basically. That's it. So the important part here, even if you do a switchover, is you need to promote, first you have to demode the old primary, which there is actually no demode command so far in Postgres. You have to shut it down and then start it up as a standby again. And then you have to promote the other one and that initiates a switchover. And so all this kind of orchestration for a failover and switchovers that needs to be done by a high availability solution and then things like leader selections, split-brain avoidance, quorum enforcement and service failover is also an important part of it. So what is there? These are the three high availability solutions that are there besides Petroni and have been there for a while. So maybe the longest standing is Pacemaker. That's a general high availability solution initiated by Radhat and Souza back in the days. It's pretty old. There is a specific Postgres agent called PAF. Postgres automatic failover. Pacemaker is pretty good in the sense that it can enforce quorum and do a so-called stoneth, shoot the other node in the head, where if one node realizes that the other one is down, it forcibly turns it off with your iDRAC or whatever operate or management system there is or VMware stuff. One problem or a couple of problems with Pacemaker in the past, we had several customers using them, is that switchovers are not very trivial. It's not very, basically you kind of tell the resource you have to be on the other node and then you have to clean it up. It's not very obvious and also it's a bit difficult to tune the timeouts correctly. But one thing maybe that Pacemaker still has that others lack is that it's possible and at least in principle to do some storage switchover there. You have maybe a cold standby and a second machine where Postgres is not running and you just switch the storage over if the first machine is down and just start it up again there. That is also not super ideal usually because that means you have to do a crash recovery after the failover. That might take a bit, but it might be a good solution in your environment. And then there's Rep Manager, which was basically I guess this is the oldest Postgres specific replication solution you could say and as the name says this was the replication manager. It wasn't designed as a high availability solution, it was designed as a replication manager. And then later on they tacked the Rep Manager demon for automatic failover on top of it and then they added a witness node and it's a bit clunky. You have to basically add your own shell scripts I believe still in order to do fencing and rollbacks and stuff. So it's not the best. It looks a bit brittle and also after the acquisition I mean it was developed by a second quadrant and after Enterprise DB acquired second quadrant it's now a bit unclear how what the upstream commitment really is. So basically the guy who was doing most of the Rep Manager commits he was at a Patroni contributor meeting last year as well. They seem to be moving to Patroni as well. So it's still working and it's a good solution for manual replication management but personally we've seen a lot of trouble with it so I wouldn't recommend it. And then third, there's PG Auto failover. That was initiated by, I think, Citus before it got acquired by Microsoft. I'm not exactly sure, maybe it was after they got acquired. It's a bit similar to Rep Manager in the sense that both of them only have things running on the Postgres nodes and nothing external. Like Pacemaker has this total, this Corosunk which is, you could say, it's external to Postgres communication layer which does quorum but Rep Manager and PG Auto failover just have extensions or databases or an agent on the Postgres nodes. However, PG Auto failover has, I mean, it got designed like that so they have a pretty good state machine. They know what happens if this node goes down, what happens with the other nodes. So this is actually pretty, at least theoretically, thought out. There's a monitor node for this kind of thing but I also think the future is a bit unclear. I mean, it's being developed. I think it's now an independent project. Maybe it's just feature complete but it also hasn't really seen a lot of uptake, I think. Personally, I don't know nobody who's running it. We also didn't have any new customers, for example, that are running already. Is anybody running PG Auto failover in here, maybe? Okay. Well, maybe I'm asking the other way around and who's running Petrone here? At least one person. Who's running Pacemaker still? No. Any other HA Rep Manager? Okay. So maybe 80% do not run any high availability so far. Which, again, could be okay. Right. So those are the three existing solutions other than Postgres. Any other solution that people might be running? Cloud native Postgres maybe on Kubernetes. That's maybe certainly something that's nowadays a possibility. Okay. So give you an overview of Petrone. I'll give you an overview. I'll explain architecture a bit and the operating concepts and then I'll give a deployment overview about how could you deploy it. And in the end, I'll talk a bit about some of the issues you could have with Ponsuvit. So Petrone, it's you could say cloud native project in the sense that it was originally done for containers. It's working in containers and that's where they designed it for, let's say. It's actually a fork of Compose Governor which was before. Right. There is one other project called Stolon which has also kind of a similar design to Petrone but I haven't seen it being very well developed either. I believe Compose is now owned by IBM. So ironically, the grandfather of Petrone is now IBM if you would think that. It's written in Python and it has a very Postgres-like license. You can do whatever you want. No copyright assignments. No open core. Nothing. It's just there for you to use. The project is basically, well, it was initiated at Zalando. They took Compose Governor and then made a fork out of it. I believe there's maybe one or two variables left that haven't been changed since then but in general, that's basically how it worked. And they're still using it, I guess. They're a big Postgres user. I mean, if you don't know Zalando, it's basically a very small version of Amazon in Germany. Yeah, they run their production databases on Postgres and they're using Petrone. So there were two people initially doing this, Alexander Kokushkin and Alexei Kluikin. I pronounce it correct. Now, Alexei, he moved on. He's now at timescale and he hasn't been involved with Petrone that much for the last couple of years. Alexander Kokushkin, he's still the main developer, I would say, but he's now at Microsoft. So there's one new person, Polina Bungina, so a lady who's working still at Zalando and she's one of the co-maintainers of Petrone. So she got involved after Alexei left. So they're the co-maintainers now, Polina and Alexander. So the major features. It's an agent. It runs on your instances. It configures replication. And they will switch over and basically autopilots Postgres in this bot pattern that you hear in these cloud-native things. And then the important part is that it uses an external or usually external distributed configuration store, so-called DCS, where the cluster state is being written to. So it's not using Postgres as a database to keep the cluster state. There's an external key value store, basically. And this is used both for leader election or in order to implement leader election, split brain and warnings, but also cluster configuration. Postgres configuration, if you want. And then on top of that, it offers a REST API that its own CLI can use, but you can also use yourself for health checks, configuration changes, all kind of things, status. There's a Prometheus metrics endpoint that can use. No, I don't think there is an exporter for that. It's just a metrics endpoint that you can just use. As I mentioned, Alexander Kukushkin is now at Microsoft, so I think one of the big parts was adding Cytus support to that, so it now supports Cytus, that is multi-scaleout sharded Postgres. You can operate that with Petroni now. And then optionally, you can use it with HAProxy. There's pretty robust HTTP checks on the REST API that you can use in order to set up load balancing. So the load balancer will always point to the primary, even after switchover or failover. You can use HAProxy, but others work as well. And then there's a connected project called VIPManager that has VIP service endpoint management. But I'll get to that in a bit, I think. So what are deployment options for Petroni? It's being used in containers a lot. There's original Spillo or Spilo, I don't know. Fat container image by Zalando, which basically bundles Petroni, Postgres, so it basically has Petroni, all the major versions of Postgres that are supported, plus a lot of the extensions, plus I believe backup in one big Docker image, and then you can just deploy and run in Docker or in Kubernetes. This is also maintained by Polina. I think there was a bit of a maintenance lack for a while, but she's now on it again. There are releases. And then there's the Crunchy Container Suite, which I don't think is a fork. There are several other people, or maybe you could say every company with their own Kubernetes operator probably also has some kind of underlying container of it. So the cybertech PGO container is just used by their operator. And then for Crunchy, I guess, it's similar. And then you have a couple of Kubernetes operators. So I think every major Postgres Kubernetes operator except Cloud-native PG, which actually was... There was a timescale state of Postgres thing where they asked people lots of questions, a survey, sorry. And it seemed like Cloud-native PG is actually the most popular Postgres operator, but they are not using Petroni. They're using Kubernetes primitives. It's a Cloud-native computing foundation project now. Maybe that's why it's so popular, but it's founded by EnterpriseDB. But everything else, so there is the Zalando operator. That was the original one. And then Cybertech forked it, so they have their own one. Maybe for branding reasons, Crunchy wrote their own Postgres operator based on Petroni. And then Percona forked the Crunchy operator to implement some new features, I guess. And there is also the Ongress stack-gres, which is also its own development. And you can put it, of course, you can use it on bare metal, just Python pip install. It's pretty easy. It just installs all the dependencies. And there are Linux distribution packages, which, well, I maintain for Debian, but there's also some by Debrim from four RPM packages. That's one of the blog posts I wrote a while ago, how we integrated it. As it was written for containers, it took a bit of massaging or talking to the maintainers to actually get it working on Linux distributions, but it's working pretty well these days. Right. By the way, if there's any questions, you can ask them anytime. So the architecture. It looks a bit like this. You have the distribution configuration store. It's down there in green, which is kind of abstract at this point. It's just a key value store. You have the Petroni service, which maintains and operates Postgres, and then you have some routing in the top, where you route the client and configuration, connection attempts. We are a middleware, or we are a virtual IP, or we are client-based failover. Now, the distribution configuration store. There are several implementations. I think if you run it yourself, at CD, either the version 2 or the version 3 API is the most popular these days, but for example, my company, we have a managed Postgres platform, and they had Kafka there first, and so Kafka needed ZooKeeper, so they run it with ZooKeeper, which is one of the options as well, because they already had it running as a service there. Then Console is another one, and I'm not sure about the state of Console, because I think it's a Hashicore project, and I haven't heard so much about it lately, so I'm not sure how well it works. Of course, you can run it with a Kubernetes API as well, and then there's a Python raft implementation called PySyncObject, which basically means you can just run it as part of the Petroni process, but it's a bit deprecated, unfortunately, because Petroni maintainers say it's very difficult to debug if there is a problem, and there's not a lot of insight in it, so they deprecated it. I believe it works in general, but it's already not so great if EtsyD breaks, you have to be an EtsyD admin as well. Maybe the one big downside of Petroni is that you're not only just a PostgresDBA, but you have to also maybe care about EtsyD if you have to run it yourself, and this PySyncObject thing is that you have to debug it there. Generally, the DCS uses the raft consensus algorithm that means that you have at least three nodes, and there's a quorum between the nodes, and every change to the cluster state or every key value store change is using the raft consensus algorithm which makes sure that there is only one source of truth that is on the same page. So the key changes are done via Atomic Compair and Swap operations, and then you have automatically expiring keys with the time to live and watch us to make it to implement the whole thing, which is... So the main thing about the whole is the split-brain avoidance. So you have quorum via DCS, and then the primary or the leader of the cluster periodically updates as a so-called leader key in the DCS with the time to live. So usually that's 30 seconds, and he updates it every 10 seconds. And if then, for a while, the leader key expires because the primary is no longer there, there's a new leader election. And then fencing might be a bit problematic, but it works in general like this that if there's a... If the primary detects, it cannot update the DCS and it's not in the so-called failsafe mode, which I get to in a minute, then it will demote itself automatically in order to make sure that there is no split-brain scenario. In the failsafe mode, it will just check whether all the other nodes are still there, and if it's just the DCS that is down, it goes on. And in the other way to have fencing is you can have a watchdog device locally if you want to on physical server, I guess, and the watchdog can then shut down the node if it's unresponsive, which I haven't seen really used that much in production, but I think it's a useful concept. So there are three loops and timeouts in Patroni that are important, and that's the time to live. I just mentioned the default, I think it's 30 seconds. That's how long the leader key is valid or is active in seconds. And then there's the so-called loop-weight timeout, which means that every 10 seconds Patroni wakes up, checks whether it's still following somebody or whether it's it can update the leader key and does a main loop, so to say, and then goes to sleep again. And then there's a retry timeout 10 seconds that it waits to reach DCS if it's not reachable. So how does the leader race work? So let's assume that right, so Patroni 1 is the leader at this point that's written here as a leader key and as a time to live of 30 seconds and Patroni 2 and 3 they have a watch on this key in the DCS, which means that as soon as there's a change there, they get notified by the DCS. So Patroni 1 does an update with a new time to live and that works. So what happens next? Well, that's the steady state let's say, but if then Patroni 1 for example goes down then they're still having the watch and for now the leader key is still active but as soon as the time to live expires they get a notification from the DCS that the leader expired that's true. And then what they do next is they check, so Patroni 3 checks all the other Patroni notes, they run and get Patroni a rest call on the Patroni endpoint and see what happens, they will get a time out from Patroni 1 and they'll get a transaction lock position from each other and they compare that and usually that's there's a variable which says you're not eligible for a leader race so transaction lock delay is longer than X megabytes but if both of them are in that thing then they will both do a leader race trying to update the leader key, create it new with their own name and then only one of them due to this raft algorithm, only one of them will be able to actually get this update through and in this case it will be Patroni 3 and Patroni 2 gets a failure back. Now that's the new leader and Patroni 2 has a watch on the leader thing again and the first one needs to be rebuilt so it has to be restarted, maybe it can be reacquired, maybe it has to be re-initialized or in case of Kubernetes it needs to be, the pod has to be recreated I guess. Now just a few operation concepts before I get to deployment patterns the most important thing about or let's say most important if you're not running a totally non-pilot and Kubernetes is the CLI is called Patroni control and it gives you this list here where you can see which node is leader, who are the replicas and also if the replicas are actually streaming which means that they're in the right correct configuration which timeline they're on, so if the timelines are differing here that means there might be some problem with the replication lag here but it can also be used for other things like a show config, you can edit config it which will drop you into your editor and then you can do some configuration changes and then afterwards they are getting applied automatically to all the nodes. You can also do a switch over or a plant restart here if you for example you're installing maintenance release. Right, another thing that's important is client failover so you have two or three or more Patroni clusters, obviously the client needs to connect to the right leader and also of course if there is switch or a failover, so as I mentioned HAProxy can do that it could monitor the REST API endpoint and check whether that works, we have some information here so by the way I try to upload the slides to the conference system but they haven't shown up yet but I certainly will try again if you want the slides otherwise you can also just approach me after the talk I can send it to you. The other one is VIP Manager it's a project that basically runs as a service on the local node and it gets configured to check the DCS and it only works with LCD or console for the local node state so if the local node is a leader it will configure VIP and if the local node is no longer the leader it will de-configure the VIP now why is that a different agent or service is that because in order to configure or de-configure VIP you need to have administrative access and Patroni runs as a Postgres user usually who is not allowed to configure an IP address whereas VIP Manager runs as root generally so this is needed the main problem I see nowadays with VIP Manager is that it's very aggressive in de-configuring an IP address if it cannot reach the DCS and if there is some trouble with your DCS like LCD went down for some reason or something else then VIP Manager will very quickly de-configure the service IP which basically means that the service is down and especially I mentioned earlier that there is this failsafe mode now in Patroni where Patroni itself can keep on running even if the DCS is not available if you run it together with VIP Manager it doesn't help much because the VIP Manager brought down the VIP so maybe the best option I think or better option is to reduce client based failover if that's possible so every year now you can just specify several hosts for a Postgres general connection string so all the lpq based drivers can do that and you can just say target session atrips primary and then it will automatically connect to one of those or the one of those that is the primary right now and for JDBC that has been even longer the case it's just called a bit different like target server type only very very old versions of PGJDBC cannot so it depends a bit if your application allows that if it just has a string that you can use I think with PHP it might be a bit problematic because it might just want one host or it doesn't allow you to add these options but in general if you just have one connection string maybe that's the easiest way to do it right and then there's configuration text I'll just get over this because it's getting a bit long on the truth it's basically you can have a local tag on in the configuration for each node if you want to have it a special meaning for example it shouldn't be failed over it shouldn't be part of read load balancing for some reason so if you have a topology where one of the node is different to the others you might want to do it this way okay then just one few words to replication modes so in general or by default Patrona uses the general asynchronous physical replication it can do synchronous replication as well if you want and then it has two different modes it has a synchronous mode and it has a synchronous strict mode so the strict mode so basically the question with synchronous replication is always what do you do if the primary sorry if one of the if the standby is down there is this big problem in general Postgres replication that the primary will then wait for the standby forever to actually commit a transaction and it will never continue so the primary is also down so what Patrona does in this case if well if there is no other synchronous or no other standby is available to become synchronous standby it will just fall back to asynchronous replication so yeah the whole thing can continue to work but then you don't have synchronous replication anymore the other option is strict synchronous replication where it will really then stop the replication rather than going to asynchronous which might be the right thing to do for a bank for example that never want to lose any transactions or somebody else who is very for that would be very problematic for them so you can configure both of them and that's for example also what the pacemaker is able to do if you want you can it falls back automatically to asynchronous replication then you can also have cascading replication possible we just replicate from thing but also that's a bit not a major feature I would say but the other thing is that you can it helps a bit with logical replication there is some discussion I guess now in a community whether that's actually failsafe or bullet proof so there has been a so called well ok so the main problem with logical replication is that if you do a if you do a failover switch over after physical primary then on the new primary there will be the logical replication slot will not be in general be configured correctly and logical replication will break after a switch of failover so what Petroni does it you can configure a replication slot correctly and then it will advance the slot as it's being advanced on the primary as well so in theory after a switch over then logical replication should continue working there is now actually a PG failover slot extension and the authors say that at least one of the problems that PG failover slots solves is not being solved by Petroni so there needs to be some integration here which is pending and hopefully it's better and probably also Postgres 17 will have much better support for logical replication failover so maybe that's just going to be integrated into Postgres and then this problem goes away and then finally there's some replica creation options so basically if there is no data at all in the beginning Petroni will just run an inner DB, it will bootstrap the leader whoever won the leader race and then the others will just run a PG based backup that's easy but if you already have several terabytes of data and by the way you can also take over an existing cluster Petroni is pretty gentle about it and you don't even need to restart it I believe and then that just works but the problem then is if you need to re-initialize the standby if there's some problems with the standby you want to add another standby and if there's already terabytes of data then PG based backup might not be the best tool for the job so what you can do is you can also configure a specific replica creation method and then in this case PG backrest but also wall G from an S3 bucket or something would be an option to restore a backup and that means that there is no additional strain on the primary because with PG based backup you have to ship the whole data from the primary to the new standby and by this you can go via the backup which might be easier and also in this case it does a delta restore so if there was already some data there and you just want to basically rebuild it PG backrest for example might be much faster in doing this because it only overwrites the changed data and doesn't wipe the whole data directory and starts from scratch like PG based backup would okay and then let's discuss some of the deployment typologies finally so in general you can run Petroni in a single node setup which might not make so much sense but it might be useful if you have several clusters with multiple nodes and you have a couple of other single node instances and you want to administrate them all in the same way so to have a dedicated and clear operation and not two things where you configure one node like this and the other Petroni nodes like that but in general otherwise you can have up to n Petroni nodes it's not really limited two node would be of course the minimal for high availability and then three nodes are pretty typical especially we'll see if the DCS is local you need more than two nodes of DCS for anything else than a POC or a minimal viable whatever because the DCS should not be a single point of failure so there should be an odd number so five, seven and so on usually you would have three nodes and then you can there's the big question of should the DCS run locally on the post-crescent Petroni nodes or should it run in its own cluster and my advice usually is if there's already an LCD cluster or something maybe not one who's part of Kubernetes because that might not be the best way if you're not running Petroni in Kubernetes then probably it's best to just use those Consul, ZooKeeper or LCD cluster because the overhead is pretty small and also if you have a lot of them if you decide to have more than a couple of Petroni clusters then probably it makes sense to have a dedicated DCS cluster otherwise maybe for simplicity it might be useful to run the DCS locally but then you should always consider that the DCS needs to have enough resources to do its job as well and if the post-cres instance itself is overloaded then that might take down the DCS and that might take down the cluster and then again you could do it with a local rough but that's deprecated so that's how it would look like for a one node set up with a dedicated DCS you have Petroni and post-cres as a leader and Petroni talks to the DCS and with a second node just have the same and you can then add more nodes as you want now if you choose to run Petroni with a local DCS I don't think it makes a lot of sense to run a single node Petroni with a DCS cluster but what you can do is you can run a two node post-cres, a Petroni cluster and then the third node is just a Quorum node you just run the DCS on this so it could be a much smaller VM for example if you have those but I think the most frequent one would be just the local three node local DCS cluster because then everything is just the same all of them are equal I mean if you have a lot of storage and you don't want to replicate all the storage to a third and you don't have other needs like synchronous replication then of course maybe just use a two node cluster but otherwise a three node cluster is as easy as for the really small ones and then finally you can have an internal DNS that would be a DCS that would be the roughed case where you just have three nodes and then the other big question that we get asked always is like can we run Petroni in multiple data centers and that's possible there's basically two ways to do this the main problem is if you have two data centers usually the customer says we have two data sensors we want to run it but the biggest problem with that is in order to have this forum work right you really need three data centers or three I don't know availability zones or whatever and this is often a problem then with the people and they cannot supply that but you can basically run multi region multi data center multi availability zone whatever you want there's two ways you can either run a single Petroni cluster searched over multiple data centers and then you have automatic fail over and you can have synchronous replication if you want or you can run separate Petroni clusters and separate data centers this you can also do with two data centers but this one should be three data centers if possible but in this case there will be only a manual failover basically and you cannot have synchronous replication then both are also explained in the Petroni documentation Petroni documentation maybe just to expand on that I mean the Postgres documentation is legendary good I would say it's really good it's well could also always be improved but it's in general considered to be really good for an open source project Petroni documentation isn't bad but it's also not super great I would say it's unfortunate but they keep improving it as well right so this part is actually pretty new I think so just check it out it's pretty well explained there right so the single Petroni cluster over multi data centers would be a stretch and you should always then keep in mind that latency might be a problem if it's too far away if it's in different parts east west coast or whatever that's not great so there should be I don't know only maybe less than a couple dozen milliseconds or something of latency between the clusters in order for it to work well and you need three data centers if you want to have automatic failover that's what I mentioned before and you need at least one DCS node per data center but you don't need to necessarily have a Petroni on the third node on the third data center sorry so it would look a bit like this you have three nodes in this case while you could still this could be decoupled of course I mean there could be a different node here if you want then they don't need to be one node but let's keep it a bit simple here and these are the three data centers so there's some synchronous replication in this case going on and the DCS here so either this would be like three nodes for DCS and three nodes for Petroni or just three nodes in total depending again on how you want to set it up and the other option would be the so-called standby cluster functionality so that's multiple Petroni clusters then distinct two clusters and each has their own DCS cluster and the standby cluster leader replicates from the primary cluster so basically you have to configure this in the standby cluster in the second cluster you have to configure the host where it should replicate from so either you can just put all the three or whoever how many nodes there are in here or just the client with the VIP if it's possible and you need to then configure the slots the replication slots correctly so I would recommend to just have the slots set up for both on both sides and then the primary slot name would be the one from the other data center I mean this would be the slot name that gets configured on the DCS in the DC1 nodes and failover then just works by you run edit config and you remove the standby cluster and just remove the three lines save the config and then automatically Petroni would say oh I'm no longer a standby cluster I'm a real cluster it promotes the leader and then you have you switched over if you want to real switch over then you should first add the standby cluster to the other to the primary cluster otherwise you have a split brain scenario there is no orchestration at this point unfortunately for this there's no way because there's two different Petroni clusters there's no super Petroni control or something that could do this at this point so either you have to switch it off first or you have to make sure that it's a standby cluster first it's a bit heavy but in general it works pretty okay and it would look like this you have the first cluster and then you have asynchronous replication so the main point is that you only have one replication across the data centers and the other standby is then do a cascading replication from the standby leader yes is there a solution where you can have multiple writers or multiple leaders no not with Petroni in that sense because that's generally a hard problem in Postgres space I don't think there is a good presentation here about this I mean Postgres is getting better at it that's the multi master let's say thing and Postgres 16 already allows active-active replication but there is a lot of caveats you have to be very careful probably you need application support because that only works with logical replication and not physical replication that Petroni uses so Petroni uses physical replication which means that basically it's a byte by byte copy from the leader to the standby sort of followers and then the follower can only run read queries on them we are hard standby and if you do logical replication which Petroni kind of supports in a sense that it will help you with failovers but it will not manage logical replication for you but if you set that up then you can in principle write on both sides but you get to keep both pieces if they break it's pretty easy to make it break still unfortunately and there is no DL replication sequences are not replicated so it really needs either good DBA or good application support so it's kind of an orthogonal problem to Petroni I would say and also I mean this is the other part I think there is no I mean maybe one or two but there is no real good logical replication like management system or something for Postgres like Petroni now does for physical replication I think this is maybe a problem I don't know somebody should write some project that handles logical replication better does it answer your question? right and just finally not some words on DCS caveats and considerations so the biggest one probably is the point that up until so right now Petroni is at 3.2.1 I think something like this and while it was still on version 2 it was really reliant on DCS being available all the time so every 10 seconds it would try to write to the DCS and if the DCS was not available for I don't know a couple of seconds it would demote the leader and the cluster then is no longer writable for the application downtime that was considered a big problem by a lot of people and it does this thing right so re-timed out cannot communicate with DCS demoting itself because DCS is not accessible and I was a leader demoting and then you have three stand-by's and no leader and the application can no longer work so this is especially always in my experience a big problem with etcd if you if etcd doesn't get for example if you run it on bad storage with VMware or something and you do a VM snapshot and then etcd is not able to write its state for one second because the snapshot takes all the resources then the cluster would just fall over and we would frequently have these problems with customers and they have these patrony is falling over every day and it's not available and it's a big problem so what three version three brought mostly site support and the so called DCS failsafe mode and this is an additional key called failsafe it's pretty easy you just have actually the only thing you need to do is failsafe mode true that's the whole configuration you don't have to configure anything on the nodes or anything just add this to the cluster configuration and that's done so basically what it does is if the leader cannot access the DCS it will query all the other nodes and ask them hey are you still there and what's your state and if they all answer back and say yeah I'm cool I'm still following you everything is fine then the leader will continue working and he will ask them all the time he will continue working even without DCS so then these outages of the DCS there will be no failover or something during this so it's kind of like a maintenance mode but it gets error communication then it gets a response from pg2 and pg3 and it continues to run as a leader because failsafe mode is enabled and all memories are accessible so there's also some documentation here so that's a huge improvement I think in terms of usability of Petronium and then the last point I want to make is some HCD considerations I already mentioned you really need to give it the resources if you run HCD locally I mean if it's just for a small database you don't expect it to have any kind of overload situation then it's okay to not do too much but if you really have a high performance database and it might be that there are so many connections that it's CPU-staffed or IO-staffed then you really need to give HCD the resources maybe also via C-groups it's dedicated core and certainly it's possible and dedicated network interface and a dedicated storage device possible and also one thing we noticed I haven't really tracked it down too much but I think it's mostly HCD version 3 so everybody's running HCD version 3 now but there's two versions of the API version 2 API and version 3 API and they're pretty different I mean that's one thing but okay that's a different story and if you're on the version 3 API maybe it's a thing how Petronium uses it but we've seen that the transaction lock of HCD can fill up so it has like a 2GB quota and if it fills up then it's not working anymore and so you have to switch on auto compaction retention to some value either with the environment variable or with this option in order to not run into this problem just so you know and that's it are there any other questions yes well you should not limit HCD you should give it enough resources first of all but what happens if HCD shuts down yes I mean if you don't have failsafe mode activated then usually that means that there is no longer an HCD quorum there is no leader and Petronium will not get the reads out of or the writes into HCD and if there's no failsafe models then Petronium will demote it's my understanding of the situation yes I think I mean HCD doesn't need a lot of resources so you can go with 3 small VMs and I think that makes sense especially if you plan to have more than 1 or 2 Petronium clusters then it's just like 3 VMs whether it's 9 or 6 I mean the additional cost is negligible and it just makes it easier I would say any other questions yes so a question is security rep manager needs SSH well security in terms of data security or compromise so you don't need SSH Petronium Petronium communicates with each other over the REST API the REST API can be secured by TLS and client certificates if you want and also you can secure the communication to DCS with TLS and then additionally also with passwords if you want and can have a user for the DCS depends of it right so that's the main thing it doesn't use SSH it might use the Postgres protocol to talk to the other Postgres instances not just the local one to the other ones that means but it doesn't need in general and we make sure it doesn't need a super user connection to the other host it just needs a replication connection to the other host or it needs a pgrewind pgrewind user that can you can configure a pgrewind that's rewinding an old primary in a sense that it doesn't need a super user you just have to granted the right functions and Petronium can do that itself if you on Bootstrap does it answer your question? we haven't seen any incidents because we can maybe discuss it later I'm not quite sure what your problem with Red Manager was but in the sense that I haven't really seen it certainly it's a bit more involved if you really want to use it you should have TLS everywhere and that makes configuration more difficult let's say but otherwise it's okay and there's also there's a couple of read only rest calls for Postgres and you can also make it so that the ones like failover or configuration change they need to actually supply a password username and password on the REST call REST API call so that it's not so easy to abuse even if you require TLS any other final questions I think we're running out of time? yeah the data yes anything below the instance you would use Postgres the only thing is that Patroni starts and stops Postgres it doesn't like if you stop it yourself but that's kind of like a DBA thing but you connect to Postgres you create the database that's automatically replicated you do all the changes schemas, users Patroni you can configure to create databases I think and maybe users but it's not necessary I mean the only thing it really wants to do is create possibly a rewind user and a replication user for that initially but everything else that's up to you yes any final question yeah like this okay well usually what the client would do is the next time the client tries to run a command SQL command over the connection or start a new connection it will get an error from I mean if it starts a new connection then it's not a problem because it would automatically see this is no longer the primary and then it would it would try the next one until and if it's still connected I think it would just get the next time it tries to run a command it would get an error back that the connection is closed or whatever and it would fail over I think it's in general okay quick but I'm not 100% sure what happens if there is a timeout like how much timeout there is if the leader is still there but actually not not really responding in the sense that port is still available I'm not 100% sure how that would work right okay sure yes I mean Patroni will Patroni there is this time to live and if I mean I can the failover I think it's 30 seconds for the time to live and you can configure it yourself if you want to tighten up the timeouts then you can certainly do this but once the time to live is up it depends a bit when was the last loop and everything then there will be a new leader election so you can configure it this way if you need it in 10 seconds then you can do it but again that means if it's very tight it could be that there is a false positive and it does a failover even though the other one is just a bit behind right okay so thanks everybody we're out of time so yeah thank you if you have any further questions I'm still here tomorrow should be better yeah okay then let's start talking about PG in plan an extension I'm Franck Pachaud I'm developer advocate at Hugabeyt Hugabeyt DB is distributed postgres we use the SQL layer we use postgres for the SQL layer but on a distributed storage kind of spanner like architecture for high availability resilience but basically because we use postgres on the query layer we can use postgres extensions that are on the SQL layer and we install by default PG in plan I really think that everyone should install it it's easy to install it may not be easy to understand exactly how it works because there are some things that are not really well documented and then here the goal of this session is to go through all the little things that you should know to use it correctly of course interrupt me whenever you want if you have any question I the goal is really to explain how to use it correctly not really to convince you to use it I have only one slide to explain you why we need ints even in postgres all databases offer that by default in postgres not by default there is an extension this is the way when I arrived at the airport so I want to come there I ask google and trust blindly google and select the best pass but google maps also let you choose some options like maybe you prefer the train maybe you prefer to walk even within there you have some option like fewer transfers in train choose which kind of transport you prefer and you can also change the route have more knowledge about the traffic than google maps you have this possibility so you have something that provides you with the fastest pass to go somewhere but you may want at least to change to see what if I take the train right if I want to go through this place this is google map the query planner in postgres does exactly the same it finds the optimal pass to your data it would make no sense to provide such thing for google map without those little options that you may want to choose and in my opinion is the same for a query planner for an optimizer even if most of the time it comes with the best result at some point at least you want to understand the plan chosen by the query planner and do a what if what if I use this index because you have created an index it doesn't pick your index you want to know why is this index usable so what if I force it to see if it is used or just my index is not so good as I expected or maybe the planner doesn't have many statistics so we need more control for a query planner maybe not in production except for some workarounds but at least to understand an execution plan and there is a nice extension for that first why int in SQL SQL is a declarative language SQL syntax doesn't tell the database how to get data you just declare the result that you run from tables even in the SQL language there is nothing about indexes this is an implementation detail that the optimizer can choose so the SQL code will never tell the database where to start by this table use this index but you may want to understand those chose having more control on it also some workarounds for the execution plan in production maybe the cause is a bug in Postgres maybe the cause is bad statistics but you need to fix that and maybe you can fix the plan immediately and then think about the root cause later and the query planner can choose, estimate the cardinalities and the cost only from the input that it has basically the indexes and the statistics and maybe you have more knowledge about your data you know that you are carrying a lot of predicates in the where close and maybe there is some correlation that the optimizer doesn't know that can be also a case of using ints so how to use that in Postgres where you don't have ints by default you have an extension this idea has never been accepted to be in the core of Postgres but it's an extension that is quite easy to install the documentation to install it is really easy and there are RPMs there are many ways so I'm just showing how I installed it in a Docker image just using the RPM look at the documentation to install it on Nougat by DB we install it by default so it's always there what is less documented is how it works and this is where I will focus on so first SQL is a declarative language so anything that tells the database how to do it cannot be part of the SQL language and this is why the ints are put in commands because commands you can do whatever that is not part of the SQL language special commands they start with a plus and the extension looks at the SQL text find those special commands and find the ints that are there we will see examples in which kind of statements select in search also prepare, explain prepare because this is at that time that the execution plan is set not execute it's quite easy to understand how it works with that but the most difficult is to understand that probably if you want to really control an execution plan one single int may not be sufficient and we will see some example of that okay first this special command where to put it usually I put it in front of the query you can put it after some keywords but only after those characters this is what I've seen in the code basically it can be helpful if you put it in front of the select you can add an explain to see the plan before it because explain and the format of explain goes into those but do not put it at the end do not put it after a while always better before because this is how the extension looks at it if there is any error in it it will stop passing that's also the problem it is in command so you don't really have syntax checking with it so be careful we will see how to see if your int is used or not but if there is anything a syntax error in it of course the int with the syntax error has no meaning but also those that follow because the passing stops the command is only a slash a star doesn't work with the dash command and most important in the prepare statement not the execute when you execute a prepare statement the plan may already be there and the ints are there to change the plan so this is in prepare statement and if you have multiple statements in a command you know that in Postgres you can run multiple SQL statements in a command the int is common for the whole command that can be misleading if you use that because you will put an int that can be seen by multiple SQL statements those multi statement commands are separated by semicolon if you are in psql semicolon you have to backslash it so those are the little tricks that if you don't take care in special cases you will see that maybe your int is not used and that can help to understand why most of the ints that you will put will reference tables or indexes actually they don't really reference the tables the table name but the alias because for example you may query multiple times the same table in a SQL statement with different aliases and you may want to int differently for all of them so you will find the alias in the int and this one is case sensitive even without code which is a bit different from the SQL syntax where if you don't put code it's always lowercase so it's good to know if you use case sensitive aliases something very important some ints that we will see take a list of aliases when you have such list there is no order saying as join abc is exactly the same as cab I will show that more in detail in the demo and some ints there is only one the leading int that defines the order of the joints this one can be used with a list not ordered or can be used by specifying the order and then you need a lot of nested parenthesis that's what we will see it's the most difficult inting the join order but it's also the most important in a SQL query you probably have a bad execution plan something important for tables you reference the alias for indexes you reference the alias and the name of the index if you have a typo in the index name so it's not a real index name or just because you have put the index name but your dba has changed the index name then the int is not used for example here you see that I force an index and I see this very high cost this is the way you can see that the int is not taken I will explain it later but basically I'm forcing an index and it does a sequential scan and that's just because I didn't use the right name for the index if I put an int just saying I want an index scan for account without telling which index it takes an index but if I put the wrong index it will not take any index that can be dangerous so be careful you can do worse with an int because here I try to use an index and not only I'm not using my index but I'm not using any index basically when you use int you will probably do an explain to be sure that the plan is what you wanted of course there is a way to have some troubleshooting information to tell you that you have a syntax error in an int by default just syntax error are displayed as info but you can ask for more verbos and and it can be it can go to the log file or you can put it to the to the client so quite good if you are not sure that your int is used or not by setting it to verbos you will get more information about the int is okay but the table doesn't exist or the index doesn't exist most important to understand is that an int doesn't force anything people think that with an int they will force the usage of an index the int doesn't force anything it just sets a very high cost for what is not inted for example if your int says I want to do a sequential scan it will set a very high cost to index access for example and you can see that in the code of it actually it's in the code of postgres because it uses the same when you set enable index scan to false for example this is the very high cost that is set which means that you may int something that is not possible but if it's the only access path it will use it just putting a very high cost but if there is nothing with a cheaper cost it will not be used any questions so far anyone already using pg in plan or having tried pg in plan yeah any problem do you find it easy or did you have to learn a lot basically all those slides are things that I had to learn while using it at some point it must be deterministic if you are using ints and you have the impression that it works sometimes and not other times then you have to look at it yeah a very good question I said it's not for production the problem is that in production you probably don't friend 1000 of sql queries that you may have to look at if you change the indexing if you upgrade whatever so it's more that it's not easy to maintain in production you can have for example you have a report and you can have ints on it but you know it it's documented so this is why I said not used in production but I use it in production on my session if on my session I want to check a statement I can use it in production of course if you have pre-production that is the same as the production do it but it's safe to use it it's more that it's difficult to maintain if you the goal of sql database is that you don't have to define the access path to the for all queries exactly I use that a lot for query optimization if I see an plan where I think it's not optimal first I want to be sure that my idea that there is another plan is good because the query planner may be wrong but I may be wrong so first I force my plan to see how it behaves and maybe I will realize ok my plan is better than why Postgres didn't pick my plan or finally my plan is not better and then yeah at least I know ok then I have some demos to show mostly what can be difficult with the join order, the join direction and some few things this QR code goes to DB where the demo is but I will do it on the terminal so here I'm connected to Yugabyte just because it is installed by default and it works exactly the same I will create 3 tables because when you want to play with join orders ok already exists sorry so table A, table B and table C so yeah to play with join orders it's really is helpful to show all the combinations so let's say that I just explain so let me show you see my query below that I just copy pasted I wanted to show the query without any int so this one the screen is a bit small but that's ok so basically I joined A, B and C without mentioning anything in int so the query planner can come with any join method starting with any of those tables and here it has decided to start with C I've put fruits as aliases just because you can you don't have to do that but I find it easy at least for demos I don't do that in production but why not so it starts with and then joins to table B and then to Ali ok why not now let's say that I want to test another join method so we have a leading int that lists the table, the query planner considers in this order so this says start to look at then banana, then cherry so I want A, B, C ok and then so I've put that just in front of explain select I want that so I wanted apple, banana, cherry and what my int did, it has changed something it's not the same execution plan actually it did it I was asking to join apple and banana and it is joining apple and banana and then the result is joined to cherry the thing is that with my int that is there I mentioned the order but not the direction so here the query planner still has the choice to join apple to banana or banana to apple so that's the first thing that is important to understand if you want to define exactly the join order with the direction you need those nested parenthesis that tells ok, first you join apple to banana and then the result of it is joined to cherry and with that you can put cherry on the other way and then it would change the join direction for the last join ok, if I run that I am sure that it will join exactly in the order I want so joining apple to banana and then the result of it to cherry so that's a way to force the join order and direction with full inting with those nested parenthesis now let's say that I want to change the join method here it has decided to do a hash join for apple to banana and then a nested loop to cherry let's say that I want all hash joins there and this is what I will use it's a different int so once you have the join order and direction then you may want to decide which method is used hash join and there you have a list and this list is not ordered this list just say when the join goes from apple to banana use a hash join and this one says when you are joining those three tables so because of this one I know that it is apple to banana and then cherry then use a hash join so it's just a way to mention which join by mentioning the different tables so here I expect an execution plan that does the same as we have seen but with hash joins for all of them so controlling the join order direction and methods and now the query planner will join in this order but may decide to do sequential scan or index scan with ints I can force so for each table decide the access path to it let's say that I want an index on my scan when going to banana so did it take my int I have the join order it goes to cherry but I asked for index on my scan but it has decided an index scan and that's simply because I don't have a covering index for that it cannot do an index scan so of course if you want something that is not possible it will not use it if I create an index that includes all columns and I run the same I have an index on this scan so check the execution plan to be sure that what you expect is there if it's not there may be the int as a problem or maybe just you are inting a path, an access path that is not possible index scan and index on my scan are two different ints it didn't fall back from index on my scan to index scan just it was not able to do index on my scan and then has chosen what was better for that so now if I want to fully int all joints but also all access methods I can add that I want a sequential scan for the last table and here let me zoom it so that we see the role here with a lot of ints I am sure that this plan will always be this plan because the query planner has no other choice on the joint order, direction, method, access path as long as the plan is possible it will always use this plan if I put less ints maybe I will see the right execution plan but still leaving the query planner the possibility to choose another one when statistics change it can choose another one so if you really want to force a plan you want a lot of ints like that but don't worry I will explain you how you can be sure that you have enough ints so what I call full inting if you have n tables n was free in my case but you can have more you need n-1 nested pairs within leading first join, second join etc and then for each pair you can define a join method there are free join methods in postgres nested loop as join merge join and then so and those one will have from 2 to n aliases the first one will mention 2 tables, the second one will mention 3 tables etc and then for each table so n times you will have to define if you do a sequential scan index scan or index only scan you can also so with index scan you mention a name the index name you can also use a reg for example if you have specific naming conventions for your indexes you may use an index but you need if you want to force a full-eating to force an exception plan you need that so if you look at all that it means that if you count the parenthesis you should have 6 times n-2 closing parenthesis of course opening parenthesis because they are the same so with this you know that you have probably fully entered your query if you don't have this number your query formula is wrong or you miss some ints this is in the case where for example you need to work around something in production you have a very complex report joining many tables with correlation between tables the query planner has no clue after a few joins how to estimate the statistic correctly and cannot come with it you force the exception plan that you know is good and it's one report you have documented that but you want to be sure that this plan will always be used and not with new data, new statistics having another plan that comes with it then in this case you probably want to fully int it and you really force a plan any question on that before going further this is the most important thing the join order probably the estimation of an index access versus another one probably the statistics are sufficient the joins the more you join and the more error it can make in the cost estimation and at some point the estimation is bad and the join order may be different and just to add the join method also is very important if the query planner thinks that you are joining 3 rows to 5 rows for a nested loop but finally you are joining 1 million rows to 2 million rows then the consequence of it is not just that it will be 2 times slower it can be 10 times 100 times slower any question slide where we are int between 2 hash joins I don't remember the slide or the demo yeah so here I'm just saying that the syntax is the same without the leading int it's not because you say abc in the hash join that it will take them in order only the leading int will define the order so in the demo for example this one let's try yeah so I do not define the first one but the second one will be used it is sufficient except that the query planner then can choose something else ok because I didn't say where is it this doesn't say hash join between the 3 others this says when you have joined those 3 ones the last join is a hash join because the order is not there first you need to look at the leading I join apple to banana and then if you want to define the method you need to say hash join apple banana or banana apple then the leading says the result of that joins to cherry and then to define the join method you just mention the 3 aliases involved there it's just a way to say the third join but it's just saying when you are at the point where 3 tables are joined then the last join is a hash join the previous one was defined by the other one it's a bit tricky the most important is to understand that you have this leading one and then for each pair you have a method that defines it with all of them it can be very long but if the more tables you have and maybe the higher probability you have of needing ints because the more tables you have also the less good are the estimations sorry the first slide yeah so not this one the extension pj plan yeah well if you google for pj in plan you will see the extension is developed by the open source team at entity and they use that in production okay let me look at the time I have many other things that I can show you but this was the most important sorry okay I will show that at the end I will go back to the demo and show the verbose thing I think we will have time there are a few things that can be interesting or maybe you had a question somewhere no enter me if you have any questions so with ints you can force yeah try to force the join order methods whatever but you can do things that are more clever the number one reason for bad execution plan is bad estimation because it uses a model with statistics it doesn't look at all data to estimate the selectivity of the predicates and again the more complex are the predicates the more difficult it is to estimate if you combine a lot of or and and like you can find in a data warehouse for example then the estimation is probably far from the reality but rather than forcing a specific access path you may force a specific cardinality and there is a hose int that says when you join A to B and this follows the same syntax so if you have four tables the latest join will be abcd just the cardinality estimated to whatever number and then in the execution plan you will see it and you can do better because if you force a hard coded value and then your data evolves cannot be really good but you can put a factor on it so here I'm telling the query planner okay find the best plan but be careful when you join abcd usually you overestimate the cost so whatever cost you come from from the statistics divided by 3 I can show you maybe quickly an example here of the set int was not this one sorry so here I display the cost I did the cost off before because the line was too wide but here it's we want to look at the cost so here sorry where is it again I didn't paste the right one apparently do I have rows ints here I'm forcing the number of rows I'm saying when you join cherry to banana you multiply by 999 same for that because I do not force the join order I need to put something for all possible joints and then when you join the 3 whatever you just force 666 as the cardinality and probably we see a number of rows there the result the result of all of them 666 you really force that and this is more clever than forcing a plan because the query planner will still decide for the plan but will decide on adapted statistics so that can be really useful something else that you can do with an int is setting a parameter the query planner parameters that you can set in a session you can set it in an int and this has the advantage to have to reduce the scope to one statement here is an example by default postgres doesn't do partition wise join which is very nice when you join two partition table partition on the join key it's probably faster to join partition by partition you use smaller temporary tables for example or hash tables but it's not enabled by default if you want to set it for a specific query you can do it with an int without a risk or forgetting to set it back without a risk that on a connection pool you have set on a session and it's not there usually if you have a problem only with a query you want the solution to be for the query you don't want to change a parameter instance-wide just because you have to find a problem in one query because the side effects may be worse so this is quite useful just setting a parameter and for example if I want to disable all nested loops rather than defining the ints for all the joins I can set enable nested loop to false in the query with an int that's a possibility but what we did there we added a comment in the SQL statement and with some applications you cannot because the SQL is generated because it is in an application that you cannot change so for that the pgint-plan extension gives you the possibility to put the ints elsewhere not in the statement itself but in a table that is created when you create the extension and in this table you will put the SQL text normalized because you don't want to put it with all parameters values and the int so this says that anytime I run a select star from table where a equals a value and and b is a bind variable a parameter statement then add this int to it you can also define the application you can set the application name and maybe the int you want to apply it I don't know for reporting but you want to be sure that the OLTP doesn't use it and if you set pgint-plan enable int-plan table on then you have the overhead that for any statement that is past it will look at the table to see if it has some ints to add for this table so you see two types of parameter if you use prepared statement you put dollar values but the question mark is what pgint-plan replaces because you can use that with prepared statement or with statement with literal so be careful with that it's really a dumb comparison on the text so of course if the real query has a space in addition to that or simply as a semicolon at the end it will not take it, it must be the exact matching of the query also if you use an explain in front of it it will not take the query because the query doesn't have to explain so this has many limitations but can be a way to work around in production I had also the question how to int something that is in a view you query a view maybe it's a view on views on views and the aliases are on the view you cannot put an int in a view but you can create a function to replace the view so you cannot put an int in a view you can put an int and it references alias in the view when it is rewritten but not very easy if you really want to do something like in a view you can use a SQL function yeah but you don't know the definition of my view my view is probably a select, a join between demo one and demo two and that works yeah basically it's a layer on top but the problem I cannot int within the view but the view then as all the aliases yeah this works also I got the question with partitions because when you partition the table and then each partition has an index we have a different name so what do you reference in your int what I've seen I tested it in the case where we needed that you can int with the global name or you can also int with the partition name but then it's difficult because you have many partitions and the partition name can change and this is where you can use the reg exp if you have a good naming convention for the indexes in your partition then you can use that it is tricky so if you use that it's probably a workaround you probably don't want to build all your application on those things okay then we will see what I mentioned about being more verbose let me just check I must show that so what was the last query if we have run I will take another one maybe this one we will take so here this is what I've set in my session so when the screen is too small I cannot go to the right one to copy paste it so the debug is off clan message, notice and now debug print was verbose I think and log if I'm not sure I will look at the slides because that's also the reason I put that in the slides and here if I run a query I should have more information like of course when it's verbose it doesn't fit on the screen so what you can see interesting there also I've set a lot of things in my session I don't need all of that but basically you can see how the ints are used what was used in my ints so what has been taken let's say that I do a typo in one for example here let's invent a hello join so is there an error somewhere ints syntax error there and then you see that used ints are leading and that's all because if I look back at this so I had a syntax error there and everything was passed but what is behind the syntax error is not passed so you can troubleshoot the thing there is also a way do we see the query here or do we need more verbose ints here I have the query and then if I have I want to know what I need to insert in the pginplan table I want to see the normalized query so if I maybe I don't know if I will have that okay I expected to see something normalized here but not let me see table A what I have there select star from A where N equal 3 what I want to see is the normalized one and I don't see the normalized one I don't know why but basically you can see also if it has passed your query so you have some information there probably not enough what I can say about all that is that if this extension was accepted more of Postgres then probably it would be better documented evolve for the moment it's an extension that is used by those who maintain it probably other companies but there are not enough usage to make everything work smoothly user friendly and that's why there are some tricky things to investigate on an execution plan it's still very useful so basically what I think is that one day you will need an int I'm saying that just because I've been doing a lot of consulting at some point you will need to understand an execution plan at least maybe not do that for the application queries but at least connect and look at an execution plan if you have a problem in production this is not the right time to compile an extension install it if it is already installed then you will be able to use it there is no nearly no overhead even if it has to load a library it is very minimal because if it's disabled it's not used so this to say if you install it there is no bad consequences and when you need it it will be there if you do not install it it will not be there also I don't know if I mentioned it here better install it before but also better look at it because it's not when you have a problem in production that you have to look oh how does it work and find back my slides to see the different syntax tricks so good to use and then you know how to use it and it can be a great tool to experiment understand the query planner just you want to you have read that between two versions of Postgres something has been improved in the selectivity estimation for example just check it with small tables and check the two plans if it doesn't take the plan that you expect then the two plans and compare the cost if Postgres has chosen one it is because the cost was estimated lower but then when you have the plan with the cost you can see exactly at what point maybe the cost was not well estimated so it's a great tool to investigate on all that do you have any question are you motivated to install it at least and test it or maybe use it to answer some question to verify something it's quite easy to to investigate a plan between two tables without it because if you have two tables you have only one join method so with the enable as join enable nested loop you can decide but if you have many tables that you join don't want a parameter that will disable all nested loop or enable all ones yeah question sorry for cities does it works yeah it works the same basically in a plan I will come back to my demo screen where do I have a plan that is interesting yeah basically what you need is to know the alias if you have a lot of sub queries or cities maybe the alias for example if you query two tables with the same name or whatever maybe the default alias will be different but if you explain you will always have the alias on it so here the explain mentions the alias and this is the one that will be used so if you have a doubt about the name of the alias if I can quickly take an example let's say that I join yeah I don't know I have to think about it but basically there are some cases where you have a sub query and the alias name is generated by postgres you cannot guess it but you can see it from the execution plan and then you can use it with cities of course in the limit of what is possible if you use cities that are materialized then it limits also the possibilities of joints yeah so the question is about the usage in functions I think you can use it only in SQL functions I mean of course you can use it in PLSQL but for the query itself the idea of the function was really to do something similar to a view but of course you can use it anywhere where you have a SQL statement except if the language interpreter eliminates the comments I don't know any but that can happen that someone writes a language interpreter and thinking okay the comment is only for the coder and we do not execute that and then in that case it will not be used so if there are some extensions that do that so we'll check that okay any question okay int that is not used because it was not possible for example if I int with an index name that doesn't exist let me just select star from A if I say index only scan A there was no comment I don't know where was the index let me just get to the example I add create index it's there so I will drop this index the idea was to run this one where I entered index only scan which was not possible and let's see which clue we add here so I see not used duplicating no do we have any information about this one maybe not we know that it was not in the used one yeah I just see that it is not in the used one oh yeah was used okay yeah yeah another limit because here it was used maybe it was not in the used one yeah if I create the index and I run the same index only scan and it's also in the used so I will check more but I don't see a difference there yeah so yeah it is an extension that works on top of the query planner without trying to change all the query planner so it picks at some point where it can to change the cost at least I find the verbose mode really good because I've been working a lot with Oracle that for long had no information at all about a int that was not used so that's already very good compared to some other databases but it helps to debug yeah but doesn't give you the truth okay we just on time if you have any question anyway I'm there if you have also questions about distributed Postgres like you got by it I will be really happy to talk about it thank you very much we have the annual Postgres QL at scale ask me anything session coming up in about 15 minutes in this room we're going to have Mark Wong and Joe Conway and Jonathan Katz and Debra Gunduz and I might be missing someone but there will be folks who have used Postgres a long time can answer any questions you might have I hope test test test alright it's 5 o'clock somewhere it's 5 o'clock here so let's get our panel started esteemed panelists please take your seats alright so first welcome to the traditional ask me anything AMA with the Postgres community we're fortunate to be joined by four wonderful panelists this year first briefly we're going to do some introductions I'm Jonathan Katz I'm a member of the core team I'm the least technically qualified person on the core team but that's why they have me host the AMA panel so I'm going to let these panelists introduce themselves they'll tell a little bit about who they are what they do and maybe a fun fact about themselves hi my name is Mark Wong let's see what do I do so I help out on the United States Postgres Association sometimes I speak sometimes I organize things sometimes I help exhibit once upon a time I help do database benchmarking system characterization stuff and I'm from Portland, Oregon I think you missed the fact that you're the one who's coordinating the Postgres presence right here at scale right so that's also him okay no one wants to claim credit for that one we'll see what it takes so hi my name is Magnus Hagander along with Jonathan I'm on the Postgres core team I'm also one of the committers on the codebase even though I don't do as much development on the code itself as I used to do I work as leading the Postgres consultancy efforts for a company in the Scandinavia called Red Berlin Pro so I'm out of Stockholm not Oregon but Sweden instead and I'm also currently the president of the board for Postgres Europe which is sort of what Mark does except over in Europe so my name is Tom Lane not I'm Debra I'm one of the major contributors to the project I build RPMs for many platforms I mean the upstream RPMs at Postgres.org RPMs the reason I started using Postgres was in 1998 the MySQL RPM didn't work on Reddit 6.0 so at that time I switched another database called Postgres so now I'm making my life by building the RPMs I'm Joe Conway I've been using Postgres since 1998 also one of the committers also run managed the team of Postgres contributors at Amazon I'll give you two fun facts one not the most complicated feature I wrote but the one that is probably the most used is Generate Series so if you use that I assume just about everyone has that was something I wrote because someone posted to the list years ago and said it would be really cool if we could do this and I was like oh yeah that's easy so it was small thing turns out to be really useful other fun fact is I started my career as a nuclear submarine as a nuclear submarine a submarine or would you have a fun fact for you tomorrow I have a talk where I use Generate Series in it so this is the AMA ask me anything we like to ask questions about top is around Postgres, open source Postgres and it's supposed to be a light fun session one of the highlights of the event because you can just ask these folks anything like all the LLMs out there they know the answer to everything and if not they will try to answer to the best of their abilities so in that spirit of open source I have an open source question for you how did you get started contributing to Postgres start with you Joe another fun story my brother-in-law was asked by a friend of his to write basically a database application to track projects this is again back in 1998 and I said oh yeah sure we can do that we'll use SQL server and active server pages and we'll create this app and the guy was like no we're not using any of that Microsoft stuff we're going to use something called Red Hat Linux and database called Postgres and I was like okay never heard of either of those things but I'm game so that was kind of how I got started using Postgres and I immediately started figuring out there were things that it couldn't do that I wanted it to do and I just kind of started hacking and sending stuff to the list and I guess another fun fact the first time I posted the list Jan Wieck told me that's a stupid thing to want to do so don't feel bad if you go to the Postgres list and someone tells you that's a stupid thing because it's just I think it was 2000 when we had takedocs.postgres.org where lots of outdated documents were over there and I used it a lot and I contacted to Justin Cliff saying I want to mirror this thing and update this thing and I he introduced me to the community like Dave and Magnus and etc. I think in 2005 I started building my own rpms and shared on my website and suddenly people started using it then I realized that it would be good to contribute to the community and this is how I actually started contributing to the community in both the takedocs website and rpms I think my first actually my second but my first larger contribution to Postgres was in the world of porting which is I ported the client side library to run on Windows because back in those days anything you did at work had to work on Windows but I could sneak in the Postgres server running on things that weren't Windows but people would notice if I replaced all the desktop machines for all the users with something other than Windows so I had to port the libpq as it is the C driver to build and run on Windows technically enough my very first patch in Postgres which I was not actually credited for but you know I still claim that was it was to make Postgres build on Linux because all the other developers were using BSD so I kind of did a lot of supporting stuff there but that's the two first ones for me Mark? For me once upon a time I worked at this little startup that was called Seq1 that was doing one I think I can say it was one of the pioneers in parallel computing and I was doing TPC benchmarks if folks are familiar with those database benchmarks so then along came a company called IBM that bought up Seq1 and at the same time a little non-profit called the open source development labs which I think everyone knows now the Linux foundation worked out a deal with IBM and said hey we need it we want to know if open source can really compete in this TPC arena so I was one of the folks that went over there and implemented an opens well not implemented the stack but took what we knew how to took a TPC benchmark and said we're going to run up against this open source stack and Postgres was one of the databases we decided to try out so I dumped a whole bunch of data on one of the mailing lists and said hey you guys care about how this looks when I run a TPCC and stuff on this and a few people responded and were happy to look at the things I was shelling out. Awesome so now we've heard your origin stories and before I flip it over to the audience let's fast forward to today. There's obviously been a lot of development in Postgres since the late 90s early 2000s and a lot of new features in there so what's one recent feature or recent work group in Postgres that you really like? I need a moment to think. Well I've got a whole talk about new features tomorrow that's the plug I get to do that one right I'll sing a lot of those to say the reason one so because I was going through them all the past couple of weeks but I think it shows also an interesting move there are a bunch of optimizations that have been put in that if you actually look at the submissions when they come in first it's like why would you want that you can just fix your query like it's easy to fix these queries to not do stupid things but then we realized that today loads of the queries that we run on our databases we didn't write them ourselves did we? They were written by an ORM or some other query and we got some interesting things there are simply looking at like there are two things I'd highlight in that particular there's self-join removal patch that's in that's basically when you accidentally join the table to itself and you didn't really have to do that the system will detect that and just not do the join and just select it all once and the other one is an optimization around using not nulls where your columns are defined as not nulls and you query a table with a where clause that either checks that it is null or that it is not null we don't actually have to run that where clause and as of the upcoming Postgres version 17 we actually will not do that anymore so I think that just the general move in that direction of trying to fix badly written queries instead of just punting it to the user to tell them to fix it I think that's a good move in a good direction that we're seeing Postgres going yes I think we'll have researched that by my talk tomorrow I don't know the answer right now sorry thank you so I'm not a hacker I never contributed to the core Postgres but the most one of the pages that I liked was removal of money type and it didn't go through so that's what I can say so I think top of mind for me is going into 17 I just committed a built-in coalition provider and and my talk tomorrow is on why that's a really good thing actually I won't talk about his patch tomorrow but my talk tomorrow will convince you that the fact that he committed this is a really good thing you should talk about his patch it's still time to write the talk so just one more thing before opening it up Magnus I have written queries explicitly as a human that have something is not null to be clear to be determined probably because I was an app developer Mark do you have a feature? I do always like these new stats tables that show up that give us some insight into more of what the databases actually think it's doing like the most recent one is the right in release just awesome all right let's hear your questions so please raise your hand and I will run over to you with the mic to answer the question and there's already one in the back make me exercise so I'm Tim Stewart by the way so I have two questions the first one so let me back up a little bit so the majority of my experience is Oracle so I spent like 25 years in the Oracle space and then the last seven in the Postgres space so the first question is on features so how exactly do you decide which features will go yes all right so the question is on features right so how do you decide which features are going to go into a particular release right because when I look at it from two perspectives there are features that maybe from the business perspective and then a user perspective right so how do you make the decision on what new features you're going to add thank you Tim I can watch me exercise because I don't do enough of it right who wants it so it's interesting as you claim so for the Postgres site first of all there is not really a business perspective right there is no business in that regard so we have the user perspective it's really the only one that we have but I guess one way to look at the business perspective would be who actually builds the features right and in particular we haven't even as you would notice if you go through my talk tomorrow you'll know we don't even know yet what will be in the next version even though we are three weeks or so from the feature freeze and we don't know what we do is basically whatever people contribute and manage to get enough reviewers to believe that it's going to be stable in time it will get added so you won't find a roadmap for Postgres even within one release when we released Postgres 16 the Postgres project had no idea what was going to go into 17 so a bunch of Postgres developers knew what they were going to work on a bunch of the companies employing these developers knew what they were going to assign their people to work on and things like that but at the whole project it really does go by we have these things we call commitFest which is just the iterative development style we use and you get your patches in and you get them reviewed and if they've reached enough quality they get included there are cases where the senior developer community will basically say we don't actually want this like people do build things that we don't want but that is very uncommon in general so in general it really goes by the if you get it done and you manage to get people interested enough to get reviews then it gets in we would never do things like we'll postpone that to the next version because we have so many other good features in this version that it will ruin our PR we'll put everything in and we'll make it harder for people like Jonathan who has to write our press releases to decide what he's actually going to write as long as it's ready and as long as we believe that the current version is going to be of this feature is going to be stable enough anyone want to add something so for us generally better to have more features because it helps the press release I am not worried about too many features but let's say you talked a little bit about the process so let's say you do have an idea for a feature like what's a way for me to be able to take that idea and make it a reality to get it committed to Postgres because this is what I see on the mailing list so one of the mistakes that some people do as far as I can see is the first email contains a patch so first what should be done was write your ideas write the outline what to do I mean which feature is going to solve some debt problem or not and then ask support or ask opinions from other hackers or other people then start developing the patch so probably that would be much better than sending the code in the first place unless you are a great hacker you know what you are doing and etc yeah I'll just add that we do have the problem sometimes is when people send in just the patch first yeah they are going to get the reply saying hey you should have discussed the architecture first and then someone sends in the discussion with the architecture first and then they get the reply hey but where is your proof of concept patch so like it does go both ways but yes in general discussing the general idea first is the right one I'll just add in terms of getting something kind of across the finish line you not only need to get people interested enough to review it you generally need to get one of the committers interested enough to like actually carry the ball across the line we've got a limited number of committers the 30 committers right now and in order for something to actually get into the code base one of those committers has to be willing to literally carry that ball across the finish line and that's kind of a bar because as a committer if you've committed something the rest of the committers will expect that you'll go back and fix any problems related with that thing so you're kind of giving yourself a future task if you carry something across the line so that's kind of one of the bars that you still have to meet now I understand why my patches have not been committed for good reason questions as contributors to the project that having an issue tracker or a bug tracker would make your lives easier in any way the hard hitting questions who wants it because you asked as a contributor to the project no as a user of the project yes I do think as a contributor no it really wouldn't help me once I've gotten to the point that I want to contribute something like the problem isn't that I can't go figure if someone else contributed it already because the Google search will find that for me right as a user of the project it would be more useful to go see when are things actually fixed which version has watched all of that stuff it would bring much more value to me in that role than it would to me as a contributor that reminds me of something actually there's an issue tracker which is called Magnus because every time there is a bug he sends me a message on telegram gives me the number and says go and fix it thank you but I've also noticed I think there are some issue trackers in the community I know the RPM project has one but also extensions because you see all of the extensions to Postgres they're not in the core code they might have their own trackers which leads me to a question since we're on this topic do you think that the Postgres project can better support extension developers for writing their extensions well I think certainly part of that would be things like what are planned for pgconf.dev where we have an actual workshop scheduled for hackers to help people who are trying to learn how to get into Postgres development as well as a whole day on extension development so those are two very valuable things it's a tricky landscape because the amount of complexity when you start looking at the total number of extensions is pretty high one of the things I've pointed to many times over the years and you've probably heard me say this there are other projects like the R project where they have a bit more automated infrastructure so as an extension developer to R you upload something and it automatically gets smoke tested and basically it doesn't get exposed on their CRAN which is how they distribute the extensions it's not like visible to the world unless it passes those basic tests does it have a regression test does it pass the regression test does it have documentation etc so I could see where something like that would be really valuable for the Postgres community and the extension community around Postgres I do not want to personally volunteer myself to go write that I was actually going into that earlier that's not really for creating and developing the extensions but for distributing there's currently no central Postgres registry for extensions throughout the years in my view none of them have solved the actual problems which is for example integration with the RPM packaging world with the Debian packaging world with the Windows packaging world you want these things to integrate in those and combining that with an interface like the one Joe talks about you can only get through if you have all the tests and getting all that automated into some sort of registry would help a lot too many but like Joe can you do that I think this will be addressed in the next few months well maybe years because a few people actually started working on a new extension ecosystem and we may hear some updates after pgconf.dev so we'll see so there's a question how many extensions are there so there's a website that was called like a thousand Postgres extensions or get a page which shows over a thousand I've seen estimates as high as 20,000 I don't know if that is how true those estimates are but it's certainly a lot because there's the open source extensions and then there's commercial extensions and then there's probably people's personal extensions that we never see I think that's one of one of the interesting design points about Postgres is that it's designed to be extensible you can just add functionality without having to fork the code or get these folks to commit it yep yep with not self-joining in mind what interest is there in sort of creating more guardrails for Postgres SQL and how do you go about deciding okay this is too controlling one of the things I've encountered a lot of is developers accidentally creating Cartesian products with ORMs so I'm thinking on that lines I have definitely created Cartesian projects on my own I know of at least one extension that will detect when you do that and throw an error so these things can also be done in extensions just like the other classic is you know you do delete and you forget the where clause like who's ever not done that and the question is more who did that on an important production database there's probably enough of us who've done that too and really blessed the fact that we had rollback but it's still pretty bad for performance so I don't know there it has been that one I think the one that's most often proposed is you know disallow delete without a where clause and the general project approach has been that that is not our job but and especially since it can be done in an extension like you can have an extension that basically looks over the query tree and goes like this node is not allowed here for the simple things you can disallow a Cartesian product more complicated things are obviously you can do it because the extension will have access to all the information but it it can become very complicated but I think the general view from the project is is basically that no we leave that to extensions I think one of the things that's been discussed recently in the community and I don't think it's going to get into 17 but there's been more and more discussions about things around resource management and Postgres and so when it comes to guardrails you know we may not disallow specific queries but if there's like guardrails around using too much memory and not having your database you know OOM killed that'll help a lot and there are other ways you can do that I actually have an extension that can do some resource management if you're willing to use secret version too and if you want to hear about that I could probably bend your ear for hours but I think there's a lot of possibilities for how we can do those things and a lot of times the answer isn't disallow Cartesian joins I've actually written Cartesian joins on purpose for legitimate reasons so okay going back to extensions I mean I think also the great thing about the commitFest stuff is that there's a lot of code review going on but okay there maybe there are some extensions like the Cartesian product thing that Postgres will not accept in core but then there are certainly thousands that in theory Postgres would accept in core how is there a way that code review code quality can be still maintained to have some ideas how that would work is it all up to the extension people or should there be some general more realistic review from the Postgres community should they all be in contra is there any thoughts on that they should absolutely not being contrived and contrived should be burned and fired and destroyed but I think also like yeah no like any code review should not be done by the general developer community it shouldn't be required to be done by the general community because one of the great things about the extensions is that it doesn't have to be right because that is a very limited resource from a contributor point or a reviewer perspective right it's if we required somehow however that would work all extensions to be reviewed by you know a Postgres committer or whatever right then we would slow down extension development drastically and I don't know how in a reasonable way like can we require like if you have one of these upcoming registries new registries for extensions you require code review I don't even know how that would work I think in the end that's something that comes down to each individual extension developers right can look with massive extensions like PostGIS which has many developers with lots of great code review and then we have lots of extensions that just have one developer who spent three hours throwing together a block of critisian joins and like once it's there right so yeah I it would be interesting to find a way to do some sort of like quality stamp on extensions that have reached a certain level but I don't know how I would just add I think some of that is a bit self-regulating you know something that's poorly written probably doesn't get popular in the first place things that are poorly written I think become evident pretty quickly of course if you're not all that familiar with the Postgres community you go kind of google and find some extension you may not realize it so I get it would be valuable if there was a way to have a vetting of these things but I mean in some way github sort of does that with stars and people following it so if you can see something's popular at least that's one indicator that maybe a lot of people are using it and therefore it's a little bit more stable but in general it really is kind of the usage if a lot of people are using something it tends to be more stable because the bugs get found and fixed conversely if the bugs are not getting fixed you won't find that a lot of people are using something so I mean that's not a perfect answer but as Magus points out we can't really kind of funnel everything through the small number of people that are involved in the core engine development either I don't think that's a good solution either then they're not really extension and I think if I remember correctly only one extension made into the core in the last 15 years t-search2 do we have any other extensions that actually went through the core yeah and just maybe two or three I think there are other examples but t-search was an extension but it was an extension in Contrib which is this weird halfway somewhere in between kind of stuff I don't know about extensions there are other features that have existed as a completely external extension but they have really all the time been targeted at being included so very few people really used them as an extension I think there were extension things around the JSON stuff before it got in but they were more used as a way to build proof of concepts which is another good use case for extensions right if you can build it as an extension people can test it out and then you can merge it in so yeah it's tricky because the fact that extensions aren't controlled is also one of the things that let them develop into things that nobody had thought of before I heard someone say it out there doesn't have a logical replication stuff count as a something proof of concept that well if I okay to put it that way it was a proof of concept as an extension that was being re-implemented I think I need music when I'm running hi so I know how to report the bugs if I find some so I know that however I have lots of asks as everybody knows right I have long list of asks what I would like to have in Postgres so each time I talk about this ask you know like Robert has who does not hate me at all he's saying Hattie I have another 120 features why I should like care about what you want and I only have 6 years left until retirement okay I know what you think I'm not so how I can make my voice I know my voice is loud but how I can make some of my asks more like you know addressed Bribe a developer with cookies or I mean in this case maybe Robert but I mean fundamentally it's like yeah fine if you're in a position to write them yourself write them yourself if you're not in a position to write them yourself but more like generically otherwise you have to find a way to convince somebody to work on them right and one way is to bribe them with cookies another one is to employ them and tell them what to do like there are those kinds of things I think in general I mean that's one of the differences right and if you're in a company you can find a manager to tell them what to do right and tell them to work on this project and explain to your company why your thing is more important in the open source world you just have to convince the developers of that without having a manager to convince on the way which may be harder or easier depending on what it is but in the end the only way to get things higher up on a developer's to do is to either make them for from their own sake or to find a way to directly or indirectly fund their work but I think those are the only ways so I think Magnus is getting cookies so I think the other part of the answer to your question is again maybe not the perfect answer but it's basically the Postgres community answer is you need to like show up at the major Postgres conferences find one of the developers and like well get them to listen to you long enough about your idea and convince them that it's an interesting thing that they should care about I mean seriously I've talked to people over the years about like PGCon and the value to me of going to PGCon over the years was not attending the talks it was attending the after talk event at the double winner where you sit around the table with a dozen hackers and just people talk about well this is you know this interesting thing or that interesting thing over the years so many of the features that are in Postgres today I first heard about at the table at the double winner you know was it the Royal Oak or whatever it was called with a bunch of hackers just sitting around for hours and hours after the conference talking about database nerd stuff right so that's the way Postgres gets done so you so you write the ask on an index card you put a cookie on top of it you hand it to Magnus and there you go that starts the process I want to build on Joe's point because I think this is an important part of the Postgres community and it's not just going out to the different after events it's the hallway track because a lot of discussions and ideas are taught there it's not just about making contributions it's learning how Postgres works so this reminds me of a story because I see Chris here and I'm going to put Chris on the spot we were at PGCon one time it was before Postgres 12 came out and as mentioned like one of my jobs in the community is to try to take what's the release notes and turn into a release announcement and there's something in there about like oh you know your your client can specify like a certificate to do a full username verification I'm reading this I thought nothing of it and you know I'm mentioning this to I forgot how it came up like Chris and I were chatting in the hallway and Chris was like this is actually the first piece of two-factor authentication Postgres has because what you can do is you can do password auth and then force the client to present a valid certificate with a username on it therefore there's two factors of authentication that go through and that was the light bulb moment because between a talk you know talking about upcoming features and it's like oh wow this is a great security feature and I think that made its way into the general release announcement so that's a lot of the value of just you know going around talking to folks in the community because we learn we understand and like we can help either build new features or articulate them questions that was news to me today questions so we heard your favorite new feature what's your least favorite old feature do do do do do do do do do do do do so the question is what is your least favorite old feature old feature how old let's clarify like more than 10 years old 20 years old more than 10 I'd say my least favorite old feature would be the fact that there's an 8K page limitation for indexes okay I'll take the money data type then because that is bullshit if you're using it stop using it it's not good for anything yes the only thing it actually does that is not sold by other things it does bad like destructively don't ever use the money data type so I know there's gonna be a follow-up question to that but I'm not gonna be the one to ask it I know there's a patch proposed to remove it for 17 but so actually I have a question so as you know I probably made clear today I started as an app developer like I came to the community as an app developer who just like love working with the database so you know I think a common question amongst app develop well maybe it should be a common question amongst app developers is like okay I built this app it's using Postgres I'm ready to go to production what would be something you tell that app developer about running Postgres in production like what things do they need to do you may be thinking I'm joking but my first thing should be get a DBA and to your database I'll double down on that what does a DBA do to my database well the the main thing you need to do is make sure your database is actually properly designed from the first place most of the performance problems I've run into over the years doing consulting helping people support most of it is caused by poor database design wrong data types mismatched data types poorly laid out tables not normalized index on every column oh yeah yeah I've seen databases where because of the way they design things they had an index on column A an index on column A and B A, B and C, A, B, C and D out to like 16 columns right and then you go let's look at how many of these have actually been used and this is like 15 years ago but they had a 500 gig database and I showed them if they just dropped the indexes that have never been touched their database would have been 100 gigs I'll add one more if you insist on going to production without talking to DBA first at least get real backups not just PG dumps and even if it's PG dumps that's better than nothing at all I don't like trying to recover databases from corrupt file systems when there are no backups at all so please don't do that or do that so that we can make more money no I like to make money different ways Magnus brings up a good point about backups I think a lot of people focus on availability which of course is important to make sure your data is always available but you really want to have those backups for a bad day in case the dreaded my user table dropped I need to restore it but also just in general it's good for other purposes too like restoring test environments and doing work around that and don't make them shroding your backups if you haven't restored it you don't know whether it's any good so we're starting to get short on time Magnus we've had time for one more question maybe two can you expand why you hate the 8K limitation on indexes and can that be turned into a feature can you explain why you don't like the 8K page limit on indexes Jonathan actually recently discovered this because when you're trying to index something that is inherently large once that thing is larger than 8K you can no longer build an index on it well not just if it's a B tree unless you have built specific support for it so the specific example is the vector data type stuff which is very popular these days and they want to get them bigger and bigger and bigger so without some additional hacking that's not it's not possible to support over well right now so right now PG vector is using 4 byte floats so it can support something I think just shy of 2,000 dimensions yeah but people are starting to push those things where they want much larger dimensional vectors it's not possible currently to index that without shrinking it through quantization or something I think we might have time for one more so a posker's user asked me the other day and I said I'll ask the developers next week so when we have a multiple column index and the second column is of low cardinality but it's not included in the where clause why don't we use the index okay so there's a clarification I think okay so an index on two columns and we only specify the where clause for the wrong column so the index doesn't get used if the cardinality is low enough that we could afford to use the index to fulfill the query why don't we I think that's basically the feature often referred to as index skip scans right and I think the reason why we're not doing it is that nobody has completed the patch yet as in I think I've actually seen some years ago a start attempt at a patch so I don't think there's a fundamental reason we're not doing it I just think it hasn't been done yet so there's a feature going to postgres 17 that's going to have the underpinnings to be able to do this it's basically the prelude skip scan coming soon maybe we'll see tomorrow at Magnus's talk what time is your talk okay very good alright let's see we could probably do one more question before I close out if you have a burning question that you've always want answered by a postgres expert or one pretending to be one in this panel I say Tim I'm going to close this out by making me run all the way no it's okay I can come to you this is come back here we're full service here so where do you see postgres in five years that's going to be my question very good question thank you alright I got the music going it'll be that much closer to dominating the database market almost approaching feature fees for postgres 22 in the next five years it will be 15 years before what Simon said in the keynotes about that we will dominate the world so five years later it's 15 years left for the domination I think as long as we continue the way we've been doing things recently I think postgres will be continuing to thrive and grow and be used more and more for different use cases than it is today I might give a less visionary answer I think there's been a lot of work going on in postgres lately that's going to help with the vertical scale so being able to continue to use more and more of the hardware resources available and horizontal and distributed scale vertical scale I think about the direct IO work being able to directly work with things at the storage level and then I think about the logical replication work going on in terms of the horizontal level and there's other projects as well I'm picking onto that at least some glancing familiarity at but I think that's what we're going to see as these workloads continue to grow postgres will be able to rise and take on more and more of them and with that we are at time I would like to thank our esteemed group of panelists please give them a round of applause and I'd like to thank all of you for your wonderful questions and of course making the AMA such a fun time at every single scale so I guess we'll do this again same time next year you can visit us in the exhibit hall at the Postgres booth and you can bring Magnus Cookies, thank you