 single database for the entire application plus the necessary high availability side services but this really went ahead and paid for that the developers built so we got away with that at small because we aren't public they made me take the numbers off this slide but I assure you they are very very impressive more companies more users per company and more usage per user year over year over year to the point where I have seen a process over 40,000 transactions a second on our four single postgres nodes these days we so this is what we were running these hyper threaded 3x up to we had this working we decided to make the move into the cloud part of this was a lot so the important part is we decided to move by the way those options of course is the always reliable pg dump and you are correct you absolutely cannot do that we tested it we threw every trick in the book at it it's about eight days cannot go down for eight days so very streaming this is the out-of-the-box postgres replication option it's binary replication so it's actually taking the bytes off of the data storage layer and the trick is it has to be on the same operating system otherwise the bytes don't make it and Aurora is a completely this pushes us into the logical this works off wall logs it but it is the you don't have the ability to port wall logs over so this is off the table as well I have a co-worker named Dan Miller I couldn't figure out what he had to do with this or how they had met him but once we got that cleared up we went ahead and tested it the issue was even with a consultant we couldn't get it running at our scale it was a brief I don't know who's fault that is but we decided not to use great but there's a cat we have to install and if you don't have the ability to the other it doesn't matter it runs completely independent this is our best exception there's a make I did that about a hundred times over the course of testing and running it I think I passed without errors like basically a state of databases that need to be on the same page of tables and sequences that you have defined as all working together however you want to actually work you have a table inside of your database it is immediately once you say this is a thing go it's going to create a Delta table inside of your own database and then it's going to install three triggers the first trigger is the Delta this is going to go ahead and take any changes and it's going to fire so we have this or the Bacardo PM script this is what so it's going to say oh time to run this thing let me go to the Delta table it's going to grab the primary key in case you haven't figured out yes everything that you're going to replicate it goes and read from the table and then pours that into a giant copy statement which it then sends over is a logical statement to your replica goes ahead and puts it in there the notify queue is happy and then after it gets confirmation that it's in the replica it comes back and what we are going to attempt replicate them yet because otherwise all of our sequences are going to be a nightmare so we have to let the dump and restore finish and just track the changes then we can replicate all the changes over the wire without eating up don't take the site down which is still running on the old service Bacardo like most logical replication systems is really cumbersome to mess with the DL this is the first well we don't actually have explicit control over the back in devs ship and have our major annual user conference right around the corner which is a scheme of free so everyone wants all their nice stuff ship and going to try to be understanding but there's a decent amount of pressure to get this done and then we're done right no because even if we get nothing runs like production so we've got to get there and then we have to have the option to fail back to the old hardware in case Aurora falls on its face so we've got to flip you have a service contract to be over there by November 3rd this is all decided officially we have done some testing with Bacardo it isn't completely a new concept but we got to move and we have to make sure that none of our customers none of our support staff none of our sales engineers even there's more of our this on my team comes up with a brilliant idea his team has been converting ID fields from integer to begin several tables closing on the magic 2.14 billion since we're gonna be taking it down time to begin migrations at the same time we have eight business days until the first so this is what we decided to do and how we were going to change the engine on a jumbo jet first we're going to do a Bacardo does have an option called one-time copy where they allow you to actually ship everything over a tree everything in your database as we are going to use CG dump for the initial blob and we're to it allows us to use parsing this is just parallel art so this is going to get us into and specifically onto a massive EC part of one from there we are going to application after we switch over and have the whole app talk since during that downtime we will have a perfect copy we don't have to do a couple of notes about this factory format is an ab in order to get our big and migration done we are going to someone you're doing a data only then you put the data in and then post data rechecking and then we were able to throw we've tested this theory we actually did it with a an integration micro service that we have this in production I move them into AWS it works the biggest one is we need so then you just need to go ahead and add this you just edit a line in the Bacardo PM file to say that and then it knows that it doesn't have permission to set a replica so within this itself a big end and put it in an energy because if you have one that's too big you're gonna error but we did know that we had probably three to six months before any of our tables would generate we will do our own checking you just ignore it and do it because it's from different time we're wrapping our brains on how do we get this much to stimulate the amount of we're just gonna schedule a downtime and see what happens and if it breaks we'll Friday night and we're doing what everyone wants on a Friday night sitting in a room having a downtime migration we page up all sides of the trap we kick off the dump we install the triggers we get all of the Cardo set up everything's ready to go except every combination of but we make a decision to go with it anyway the triggers are on the table they are successfully tracking our diff and we aren't gonna kick off the thing for another week so that's the only part we have to fix we take down the maintenance page and work through the rest of the evening till about midnight we as soon as we wake up we're working on it all Saturday decide to actually go through the log visually and how you should be this is an array of integers at some point in our distant past someone thought the best way to deal with a parent array of its children's primary side note do not ever do this it is a terrible idea and I can tell you there is absolutely no way this would get shipped today but we had it buried way back and we'd switch the order of operations is what we had to do instead Cardo goes to the Delta table and grab this first it is Monday morning and we get to more bad news the attempted things really departing because we are running a long time but we kick it back off and start digging into all the different options we start playing with package sizes that Ricardo is actually spending we're looking at different ways of distracting the diff realistically though we can't come up with Ricardo doesn't delete the diff until it has successfully posted to the replica so we aren't completely sunk we haven't lost track we just so Tuesday night four days before go time we get on a phone call and come up with a plan we're going to redistribute 700 odd tables that are currently in one sink into seven different things the hope is that the smaller actual diff will make it across the water and luckily you can redistribute this and it'll still use the same Delta table so we deploy this in like an hour a good idea because but the diff for that one table was 1.8 million inserts and I will tell you the split thing work thing three of our things have actually others are on their way we can act done so very excited so that picture goes off every single time you write during our high traffic times that's about five and I didn't know this but within Postgres the notify queue is we actually max that out and because every notify though we actually took our we did figure this one out it's relatively easy those figures and then you have to figure out some way to get yourself we tried a number of different cron jobs and the other thing we did for later things is if you can have tables that don't get kicked very often active tables regularly needed so we got the things taking care of and we're able to switch over for Sunday this is us all set for the big switchover Saturday night basically we're gonna take the site down and just debug until you can make it to the website there are other people doing other things this is basically most or all the domain experts in the company helping us but I'm only here to talk about what I care about replication we tear down the existing they have the most recent information before I did that then we go ahead and set the they're happy we're golden guys go to work I'm basically just watching all their test data we need to make sure it's tracking and then we also need to mention and it is it's working we've debugged it it's going great I checked the log to see what happened so I sit and I'm looking at it the cron re-kick of the thing actually picks it up and it works so I'm watching and a different one guys and this one the re-kick doesn't pick it up so I go ahead I basically re-validate the validation find pull the thing up oh one thing really quick that's hot you with Ricardo when you say Ricardo validate this thing it will say validation done with an exclamation point that doesn't mean it succeeded that means the process ended you have to go to the logs to find out if you're validation actually that was a real pain in the butt at first but logs they were good so what is default setting on so I spent four days basically watching all of these things 18 hours a day out of one I restarting them and they wouldn't be going we did eventually get this figured out and it was the reason that our big thing wasn't succeeding it was well over 500 gigs by itself the VPN would reset and your screws also I had a friend helping me break this he was thinking an elastic first cluster simultaneously so this definitely was not going to work we eventually set that to hey why don't you just never reset the password we'll do it once a week of irreparable damage basically we were able to get over Aurora didn't fall on its face immediately we were able to take down the replication release the schema they had one week to get this stuff in before we locked it again for the usingly well we had one down time so of course our first question is how do we like our new home luckily this is not titled there and back again so it hasn't been the worst the extra connections that we were able to get out of Aurora as opposed to on hardware did actually cause a slight drop database operations because but overall we are glad we made the move we are there we currently have no really quick the shameless plug poker legitimately is an amazing place to work most of our we are higher if you like or hated this talk this is how to let me know I'm not enough social media to give you those links but I would love to answer questions or if you never want to hear me speak again it's across examination begin how are you dealing with the cross two questions related to the time stamp table you said that every table must have a time stamp table yes that's a requirement and they must have a primary key also yeah so the time stamp is in the Bacardo Delta table which Bacardo creates for you you don't have to deal with that but every key does have to have every table does need a primary key because that's what Bacardo actually works off of that Delta table does not have any information for the tables without primary key will not be correct you have to go to some kind of a surrogate key on there we had 12 tables that we had to come up with some kind of a primary very normal that sometimes you have a lookup table for something like this and do not have the primary key and that's what all but one of ours were lookup tables and then we just had to declare that that combination is unique and it is a primary key it does allow for combo primary key so you can definitely just do that but yeah somehow you're going to need to okay so once that you are creating another table for the final stamp how much of space is increased because of those tables because everything must have the equivalent absolutely so it is a bump in your usage but because it doesn't actually track the data just your primary key and timestamp they're pretty lightweight and if you're changing lots of different rows it'll take up a decent amount of space if you're changing the same row over and over and over again the deduplication will keep that really small and I will tell you this when Bacardo is running normally those Delta tables don't get enormous it's siphoning out of there the only if this was a hunt but we figured how are you doing personally my question is in retrospect if you have more time what other alternatives you said you ran this off of standby so do you set up Bacardo on the on the primary and then you had all the the tables and stuff we're all replicated over so Bacardo has to so we wanted to throw a bunch of extra cores at it and we knew that dump was going to take about three days so while the replica we directly done for very tricky as far as getting the right blocks on the right tables at the right time but by taking one of our replicas and you had to know kind of what LSN that was at that uh that's exactly right right okay thanks so did you have some sort of rollback plan if things have gone bad I mean as far as I know you can't use Bacardo to replicate out of Aurora you can't that's what we were doing so we had when we tore the sink down we then had Bacardo go and install those triggers on the Bacardo side and then it was tracking the disc and it was replicating back into the old server that's the amazing thing about Bacardo and why we ended up going with it is we ran it on that Bacardo one server but anything it can connect to via PC cool it can operate with at any all we had to do was cut off the replication do it again cool thanks I wanted to move it off of RDS because the bugs that have been quite fun and um the um and we were paying more for RDS than the person that was really wanting to um and also not able to install one extension that they wanted I forget which one it was so because the open to partners with the crunchy operator first but it's really more complicated than we needed for a database that size and so we went ahead and deployed back on Petroni which is where it's running right now so we were looking for for the Petroni thing is what we call an 80% solution and as in terms of 80% of the number of instances out there right something that works for small databases that has three out of my data ha that is by default availability first um and that's kind of an important distinction that could be the one and two is that if you're running Petroni with default config um the logic that it goes through will sacrifice consistency in the form of throwing away transactions on a field master um over um having the system be down um in order to preserve consistency now there are ways you can treat that in the configuration and not really get into the depth but know that that is the default configurations availability first um that integrates well with Kubernetes based load balancing and for that matter is AWS friendly Zalendo started on AWS a lot of people still run in an AWS environment even if they're running in Kubernetes um and therefore we want to actually be able to plug into that if that's what people are using so I said I would define some terminology here so Petroni is a high availability demon written in Python um and Spilo is something out of the Zalendo team which is a container image containing Petroni Postgres and an assortment of other utilities that they use in their production environment um you probably won't use Spilo out of the box what you do instead is adapt Spilo to your environment um but they really have done a good job of integrating the other tools so you may use it as an example even if you don't actually use their particular stack so what I mean with this is a demon is that we have Petroni which is this process that's running and that controls Postgres starting and stopping and replication um that is within a containerized environment like Kubernetes what we call process one in the container that is the process that is called when you run the container is Petroni not Postgres Petroni starts up Postgres as a sub-process for itself rather than Postgres starting up itself and that is important for automating failover because we particularly because Petroni has to be able to shut down an individual Postgres instance if it is ceased to be part of the replication system um and it also supplies a management API because again we're going all DevOps you know um configuration before management and so therefore we must have an API to do things on individual Postgres instances that does not require us to exec into the containers because it's not even possible in a lot of environments um the basic idea here is that we've got this Petroni demon that is our driver um on the individual Postgres instances um so overall this is all part of this sort of goal that we have of simplicity usability and availability as as our three sort of primary goals for this um part of this and this is where this gets different from solutions like the crunchy operator um like stolen and a couple of other things that I've seen is we want to have a system where you only need one container image that is that one container image contains everything you need for a Postgres um and then you deploy however many of those that you need to scale um there are reasons to have more complicated setups but there are a lot of times when simplicity is its own value um and so we've got the one container image we want to rely as much as possible on the infrastructure that kubernetes supplies us so that we don't have to re-implement that um and we want to have a set of defaults that just work for a majority of cases right where people are dealing with simple cases I have the single rails app it needs a Postgres database behind it um I wanted the Postgres database is a gigabyte in size I want to deploy it on my OpenShift cloud or whatever and then I want to not worry about it that that'll work for that person and then people who have more complicated desires can then tweak the behavior for usability we have the management report which I'll show you um you can configure stuff um Petroni itself has a plugin infrastructure um so you can write your own plugins for things like continuous backup to put them in there um and plus several already exist within the Petroni repo um and importantly compared to hosted solutions like rds like eks like there aren't any limitations on what you can run in Postgres right you want to run an alpha version you want to install a bunch of bizarre extensions um you want to actually run timescale db instead of postgres um I you know you want to do a whole bunch of other things knock yourself out right that is just a matter of what's built on the database and as long as it understands um the postgres signals um and uses the same replication protocol we're fine um and of course availability is fully automated replication that failover so um I did say there would be a little bit of talking and then a bunch of demos so now we're to the demo portion so we're going to talk so what we're going to I'm going to demo first is a manual deployment using a kubernetes deployment file this is what you would want to do if you only have one or two databases um uh because it's honestly just easier in terms of sets right and then I'm going to show you the thing that you would want to do if you like Zalando have 200 databases and therefore you need stuff that is automated from a deployment standpoint not just from an availability standpoint um so the way that this deployment works through and here's a bunch of kubernetes keywords so actually before I get that says how many people have actually used kubernetes for anything before okay it's a whole bunch of people great so most people are familiar with some of these keywords here right which is so what's going to happen in the manual deployment is it's going to create what's called the stateful set which is what you put things like databases on in kubernetes with multiple nodes um it'll create manually create endpoints um at the moment it'll create a service um so that you can actually connect to postgres it'll create a secret so that you have postgres passwords um and it'll create a service account so that you actually can access postgres because you have security defined for it um um so um let me show you and kubernetes is going to deploy a bunch of those well you know and then what's going to happen is all of those are going to communicate back to the kubernetes api and they are going to hold a leader election because the first thing is you don't want to have to indicate which one is your initial primary right so they are going to hold a leader election and decide which one is going to be the primary um and then one of them is going to win that leader election and that one is going to become the primary and then the other ones will become replicas and they will start replication from that primary so let's actually do show you some of that it's the wrong way so i'm going to show you some of the stuff first and then we're going to do it because it's very quick um so if you look in the tutorial thing um you'll see the same file that i'm showing you here um this is a little bit of an expansion on the example one supplied in the petroni repo itself um and basically so here's the things we're setting up here right so we've got a service um uh here that allows us to connect um uh to the database now one of the things that we do for those of you familiar with kubernetes you're not familiar to manually setting up endpoints because normally endpoints get automatically created when you create a service however in the case of petroni we actually manipulate the endpoint during failover because we need to make sure that it isn't possible for other services to connect to the old master and so we actually manually create an endpoint which is not something you'll see in a lot of kubernetes examples before um and then we create another service and this service is to connect to the current master only because when you are connecting applications to a primary replica cluster right the things that need to write need to be able to connect to whoever the current master is and the things that are read only can load balance among all of the nodes and so we want to take advantage of that i mean when i my first deployment of petroni was actually in the system where they had 33 read replicas because they were basically using postgres as their caching layer and so um lead load balancing was the majority of the vast majority of their workload and so we wanted to work for that and it does so we had a service for just the master endpoint and then here's the stateful set um and with all the usual things and of course here we're pulling in all of the secrets um as well as um as well as setting some labels now one of the things important labels that we're actually setting here is a cluster name um and it's important to set the cluster name label because the basic way that it works is you create some number of nodes and then at this point the individual petroni demons are autonomous except that they have a cluster name and by having that cluster name they then connect to the kubernetes api or if you're running off of kubernetes whatever other discovery service you have and they say hi i'm a member of cluster petroni demo is there a master i can connect to if there's not a master to connect to can i take part in a leader election um and so that cluster name determines which cluster you're part of which is important if you're running multiple cluster clusters on your cloud plus we reuse it a bunch elsewhere um you know define ports that you're connecting to and everything else and then of course here we have the secrets and the passwords and a whole bunch of arbex stuff which i'm not going to get into because this is not a tutorial on our back so let's go ahead and do that i'm doing this locally in my computer in mini cube so okay and that creates all of the things shrink that a little bit so what you're seeing here is is we're starting up all of these and we've got i'm going to exit the demo shell because it's clearly having problems so let me okay i did something wrong in my teardown running through this so let's diagnose that and you'll get a little bit extra sorry about that i forgot that i lowercase it um so you can see we actually started up automatically with a cluster and i started up with a cluster and you get a master replica now using kubernetes with stateful set the initial master is always going to be petroni demo is all going to be the zero because stateful sets always start up their nodes in order um so the only way that wouldn't be the initial master is if um if there was some sort of halt in it starting up um and the other one actually completed the start of sequence for that but usually this is actually what's going on um and um and you're going to actually see that we get log output from each of these so um the log output that we get from the current master is it says hey um the leadership lot owner is petroni demo zero i'm petroni demo zero i'm not going to take any action because i'm the current leader and the one that we get from any of the replicas is going to be hey petroni demo zero is the current master i'm petroni demo um i don't have a lock i'm a secondary and i'm following a leader um now if there was some kind of problem with the cluster you'd get a bunch more output about what it was trying to do um and so anytime this comes up and if you don't immediately see a live cluster with um a primary and everybody else's replicas then the thing to do is look at these logs first because we supply a fair amount of output um about what could be a failure on this so and if you actually look at what we've got um in um here for the pod um you can actually among other things see all of the configuration that we've actually set up for it as part of the deployment now some of the important metadata here goes in what kubernetes in kubernetes config map um and so there's one config map that actually has any non default configuration in there um the um and the only non default configuration i'm currently using is that i have pg rewind turned on um so pg rewind for those that aren't familiar with it is a tool that allows you to take a former replica that might have been slightly ahead of the current timeline roll back its transactions so that it can then rejoin a replication cluster um which you might do after a situation where you had like a network down situation and stuff comes back up unevenly and you don't necessarily get the best replica of the current one this is whether or not you turn that on is one of those decisions of consistency over availability which you have to make all the time when you're setting up this kind of a system um the um and if you set a whole bunch of non default configuration stuff you'll see all of that in there um and then we have a simple config map that just contains informationally who is the current leader um that they you know the leaders batoni demo one and that sort of thing um in cases where you're troubleshooting oh and also the time stamp because effectively we have a ttl here and if the leader hasn't updated in a certain amount of time then they are considered a wall and we start a failover procedure um because if they haven't updated in a certain amount of time then that means they can't reach the kubernetes api and if they can't reach the kubernetes api then it's likely nobody can reach them um and so we start a failover procedure so speaking of failover procedures let's talk about how that works and then let's do it so failover time so obviously part of the whole reason why you're doing this is for automated h a right so what happens in automated you lose a node that is the current primer right so that's gone so then what's going to happen is after a certain amount of time the petroni demons on the other nodes are going to notice that the master has not updated its time stamp and is therefore not communicating um they will communicate with the kubernetes api we will have another leader election and this leader election happens in two stages one is one of them grabs the leader lock the one that grabs the leader lock then checks to make sure that it is the furthest ahead replica um by furthest ahead i mean it has the most transaction the furthest ahead replica or if it's not it's tied for first and then will become the leader if it is not the furthest ahead replica it will release the leader lock so that another replica can grab it but assuming it wins then we have a new primary that gets promoted um replication starts to the existing existing replicas remaster so that they can replicate from um the new one in the meantime kubernetes will notice that hey we're supposed to have three nodes in the stateful set and we only have two so it will start up a new pod in that stateful set to replace the failed pod and that failed one because it's still a member of the same cluster will notice that a master already exists on that cluster and it will begin replication from that master so oh we want to know this again something here so by the way the lock that we're using here um is not some other form of the lock um but there we go um but it is the actual end point um for the master connection um and the reason why we use this lock is the two reasons number one is kubernetes allows us to do relatively atomic operations on that endpoint definition um as opposed to something like config map where it tries to merge changes which would actually not be a good thing in cases where you're having a leader election uh the second thing is this also determines how traffic gets routed to the master so this means that we can change how traffic gets routed and who is the master as a single action which is important for preventing race conditions so let's go ahead and um trigger a failover right now i'm running this in minicube in my laptop so i can't actually kill off a host machine um because it's all running on my laptop um so instead what we're going to do is we're just going to kill a pod and that was kind of hard to catch so you can see this here is we're getting a mixture of post crisis own output for the failure of replication right because the master went away so replication is not working anymore um and then it immediately went into um so first can't connect to the master anymore um and then we send stuff when the leader election it promotes itself um it notices that we're missing a replication slot um that replication actually gets created is not a created message um and then now petroni demo one is our new primary now obviously in a real network where these different things might be in machines a couple of racks apart that can take longer than it just did in my laptop um but i with in production cases were deployed we were able to set um an sla of 10 seconds for the failover and adhere to it um by by tweaking the settings correctly and there are a bunch of settings associated with the various timeouts um that you have to set because you don't want to set the timeout too low you can get full failovers just because somebody is having uh network sluggish notes so that is our failover so now let me talk about a different way to deploy this so within the world of kubernetes we have these things called operators um and this is in the same term that we use the term um uh system operator um or uh bus operator um the um which is um someone but in this case it is a program that drives the thing that controls the thing only in this case we're doing it as a program and it's not a person um and the reason why you would do this is because you actually are going to have by you know five or fifteen or two hundred different little postgres clusters running on your cloud infrastructure and for that reason not just their availability but their deployment needs to be fully automated it needs to be fully automated in a way that it is controllable by templates through get ops or through a configuration management system um instead of needing somebody to log in and send kubernetes commands directly so uh the operator i'm going to be showing off is the one that zolando uses for their production infrastructure like i said you're probably not going to want to use it exactly out of the box the way that it is but you can go ahead and look at their operator mainly the thing you'd be tweaking is the things that are in the silo container itself rather than necessarily tweaking the operator code um the um but um the basic way that it works is we install the operator which is something you would only do once per kubernetes cluster then you create a manifest for each postgres cluster you want to deploy and if you need to modify the postgres cluster you do it by modifying the manifest um so when we're talking about the code here so we're going to actually create the operator which is done with a bunch of deployment steps in kubernetes just like anything else so all of those creating the operator um creating the config map there's a bunch of different steps for actually creating the operator on the system um let's go ahead and do that okay so we create the config map and then like everything else in our newly secured world we have to create the arbeck service account again role-based access control mandatory for kubernetes because we don't want people running in secure clusters out there but it does mean you actually have to create permissions to say hey um this operator is allowed to create clusters of stateful sets etc um and then let's create the operator itself and so then we created so crd um let's forget what crd stands for a crd is a kubernetes extension what custom resource that's a custom resource definition it's such a goofy term that i can never remember it but yes you're right uh it's a but it's a kubernetes extension and so we need to look at this when you actually if you actually do this on your own on your machine or whatever this step will take a while because there's several containers that need to be downloaded to create that um in my case i already cast the containers which is why it was in um so we've got the crd um so that's clean i ran through this before we gave the presentation and those pods are done but they're not actually cleaned up so we're gonna go ahead and delete so in the meantime let's actually talk about that manifest so then if when you actually go to deploy a cluster what you're going to be going ahead and doing here is you have this manifest you give the cluster a name again same thing the tony right the cluster name the postgres cluster name determines our coherent replication cluster um the um because we're talking about a large infrastructure here we also have namespacing so there's actually a team id which forms the first level of the namespace um for all of the clusters where you would use this is if you have different development teams um each of which works on a different set of applications they would have their own in that way they don't have to worry about name collisions between postgres clusters um you know and uh we need to actually create a persistent volume because of course postgres data needs simply to store its data um and so you create that person you tell it how big that's going to be and here we're assuming it's a very small database so we'll need persistent volumes that are one gigabyte in size um the number of instances that we're going to have that is the number of postgres replicas that we're going to have um create some database users um with definitions right here so we have the zolando user who is a super user with create db powers we're going to create uh you could create other users in this case it's an application user with no special permissions um and then we're going to go ahead and create some databases um oh and also you know uh the um database that we are creating and its owner so the application database is called foo and the owner and then what postgres version we're using which in this case is version 10 um and so that's the sort of manifest needed to deploy a new postgres cluster so let's go ahead and do that just a little bit that was running oh there we go just took a second for the leader election there and so we're going through the leader election again and we've got a master and then pretty soon these two will report back that they are replicas so same thing different mechanism deployment like i said this is a case where you need to actually do a bunch of them and therefore need to be able to handle um need to be able to handle having a whole bunch of sort of rubber stamping them out with a template so now that's fine for deployment but obviously our applications need to be able to connect to postgres in order to actually do anything um so uh the way to actually set this up in the demo is the way that i end up generally using it which is actually create two different services within gubernete one is the service that you have to create anywhere to deploy it which only points to the current master and the second is a read only service that round robin load balances between all nodes um and the reason to do that again is you often have read only traffic you want to have the replica shoulder some of the load for that um i if you don't really have that much load period then you don't actually need the read only thing that's just there for efficiency it's not a requirement um the in the case of connecting to postgres um your application is going to pull the passwords from uh the gubernete secrets right so when we deployed the cluster it created some passwords that are now stored in kubernetes secrets the application doesn't actually get those passwords directly instead within the individual running container it connects the it pulls those out of secrets um so let's go ahead and try that and then that's going to be our last demo here okay so i'm going to create a container that contains the postgres client application um and i'm going to put it in the same namespace as the cluster that we created so that it can actually share some of its information okay so we are now shelled into that container um and so from within the container which is now in the same environment as the postgres cluster um i can just address it by name and that's very important because you don't want to be monkeying around with kubernetes virtual ip addresses within your application code because i can tell you will not get that right um see um instead we just use the internal discovery services to address it by name i already gave this cluster name bitroni demo um and that's the service name importantly from kubernetes perspective and so i can connect directly using the service name um i've already set up the pg shell container to automatically use the credentials provided by the secrets so i don't really have to do anything else and from that perspective if you're looking at this in terms of building an application it's actually easier for your application developers to not really have to think about that right they're just connecting as if they're connecting to a local postgres just like they do when they're testing on their laptop and so here you can see we're connected to the master for people who know postgres replication this is asking are we in replication mode which would only happen in a replica um and we were here in the master now if i want to connect to the load balancing um stuff let me if you connect specifically to petroni demo 2 um because i know that's a replica and you can see here that's a replica so one of the things that stateful sets gives you is that you can address specific nodes by appending that node name to the name of the service and that's useful when you need to connect to a specific node for something because particularly like sometimes you set up a replica for as a special report replica or whatever and you don't actually load balance traffic to it and so you might want to do something more sophisticated than just a load balance just to everybody and at this point i don't have a lot of time left so that is all i'm going to do out of the demos for here um there are more demos you can try yourself in the github repository and go through the rest of the stuff such as the management api for petroni um which we're not going to go through for this um a few other words when you're actually testing this out um you do have to actually clean everything up um those config maps i showed you those will persist after you've deleted the postgres cluster and that's on purpose because we don't want we don't want to erase the data of a cluster that is down rather than deleted um and the config maps the placeholder for that but that does mean you actually wanted to erase the cluster you have to remember to delete that config map as well the um i of course i work for red hat i work in kubernetes we work in kubernetes because we have our own commercial distribution of kubernetes called open shift which is you know slightly behind version of kubernetes with lots of user friendly stuff um you do have to do a change in the config for petroni to have it run on open shift um a small config option called use endpoints and then the one other thing that i actually want to say is this is an open source project written in python we have a bunch of contributors from a bunch of different companies working on it your contributions are welcome um you know whether it's fixing bugs adding documentation um adding plugins for specific environments that we don't currently have um which is most of them and there's a list of resources uh for you um i think we have um if we have this in the if we have these days in the postgres wiki i'll go ahead and attach the slides um so you can actually get this list of resources so um all of that said we have about five minutes for questions so questions okay um so the way that you actually handle the backup portion of things is that you use one of the plugins um like i know crunchy contributed a plugin for pg backrest so you would enable that plugin and give it its configuration so that knows where your storage is located um there's a plugin for walle for people who use the s3 as your storage um the you need to have something that's going to be a program that sort of runs autonomously there um you could actually probably come up with a sort of simple shell plugin that would back up to persistent volume attached to the container if you have some other storage configuration then for restore you actually use a second plugin that is an action that executes when you bring up a node okay and so then it consults the long-term storage first um before trying to restore from an active master which is important in case there isn't an active master oh really somebody else wrote it okay they said somebody they said somebody else actually wrote the pg backrest one i didn't know it okay okay cool how to handle partitioning yeah it's are you talking about partitioning or sharding okay um i've played around a lot with slightest sharding um on kubernetes because this was part of my goal was to support that um i'll admit that i got sidetracked because this is not actually my day job um and that gets more complicated because the shards are not identical if you follow me right each shard takes x portion of the data um and stateful sets doesn't really give you support for that and so i started down the road of writing a custom controller that would actually pay attention to that i've just never completed it so yeah so implementation required i guess i'm not understanding what you mean by okay so like a pod a pod security i mean so what's the problem there um well i mean obviously you can prevent you can inadvertently prevent anybody from accessing the database that's the very easy thing to do um the um i mean pod security policies are also uh a way that you can a beneficial way that you can do this but but the thing is the pod security policy would need to be defined in the stateful set to begin with for it to even exist so i don't see why it would be a problem other than misconfiguration yeah okay yeah i guess i'm not really understanding why this is a problem so let's talk about this out in the hallway and and one more well he's setting up and then any further questions we'll have to be out in the hall yeah we're wondering um no uh not through this um so um but there was actual talk about that because what you would want is to not have that node be managed as part of the individual petrony cluster if you follow me um instead you would create another node and attach it to the same master um i you know sort of hand create that um there was a pr specifically because one of the problems is there was a pr to manage replication slots so that node wouldn't end up getting clobbered um i don't know what the status that pr is i think it went in um because because one of the issues is otherwise when we redefine the replication slots that node would lose its connection um the um but there's no way to manage that through petrony itself because petrony again wants all of the nodes to be identical um the um and although there are various ways you could hack it actually because what you really just want to do is pick up a different postgres config when it starts up um and um like you can prevent connections to it particularly on postgres one of the things we're looking at is how postgres 12 is going to affect all this because the replication config was in different place now but the that has the advantage that you could actually use system control to modify the replication config yeah so again possible to actually require anybody's further questions out in the central area okay good to go well come back after lunch who is asleep yet who well i can't ask that whose neighbor is asleep already okay well uh so i like to do when i get the the slot after lunch feel free to sleep if you want but please don't snore and try to avoid doing that part um so we're here to talk about postgres my name is magnus hagander i work for a company that is called redfield linpro we're an open source services business based in scandinavia i'm here from stock home in sweden uh i lead up our database efforts which are you know unsurprisingly centered around postgres that is why i'm here within the postgres community i'm a member of the postgres core team so the steering committee for the project i'm one of the committers on the database and i'm currently serving as the president of the board for postgres europe which is the non-profit that coordinates community activities throughout europe but we're here today to talk about postgres 12 and i do apologize for the crappy projector picture we tried our best to fix it and failed but i mean you're all in technology one way or another right so you know technology and so hard so we're postgres 12 who's using it today already in production again really just the two of us oh come on it's all good fun who's using it in production for something really mission critical maybe not okay so who's that who's using postgres 11 okay that's a couple of more 10 nine six nine five nine four nine three nine two seven four okay good i got rid of steven i mean don't discount it uh i ran into a customer just um late last year who was running six four in production that's impressive it worked they thought it worked i mean it kind of worked you yeah so you can't really upgrade out onto modern machines so anyway we're going to talk about postgres 12 and you know postgres 12 is not really done yet it's even to the point that i put the wrong version number on my slide because that's supposed to say that postgres 12 is not done yet postgres 11 is definitely done uh postgres 12 is still during development and it's scheduled to be released later this year if we look at the actual schedule we work in postgres we work with something that we call commit fast or cf that is really just you know all software development does some level of iterative development and we all have to have our own names for it so that's the name okay the general idea is we do one month's worth of development and building features and then we do one month of reviewing and committing those features that's the part that we call the commit fast and then we repeat that and then for some reason we only get to four or five of them in 12 months because we target a release approximately every year the reason for that is that the final one tends to take about two months and then we just take a break while we stabilize and just sort of fix the upcoming version so if we look at what is postgres 12 and this is actually the correct version number here in july last year we branched off what we call uh postgres 11 that eventually became released in october of last year um the way that we work is we branch off this into a branch that we call 11 stable and we just declare it stable and no no no suddenly it's stable um and then we open up the master branch or the head branch of the repository for the development of what's then going to become postgres 12 we ran one commit fast in july we ran one in september one in november one in january and we are currently running one in march uh normally it's also the the one in march tends to be the one that gets the most actual features whereas the earlier ones get a lot of infrastructure that build into these features but today anything that is not actually either already committed or in this commit fast will not make it to postgres 12 now the things that are here the things that are committed already which is only going to focus this talk about may go away right this just because you heard it here is not a guarantee that it will actually be there it has happened we usually try i mean we'll we'll primarily try to fix them if they're broken to use it but they're it's certainly possible that some of them have been committed and will be reverted because they weren't quite ready and given that we are in the march commit fast and we're only a couple of days into it we're about a week into it there's quite a lot of these things that we just don't know yet but we have high hopes i've excluded some of the patches that are in the queue that i really don't think is going to happen so there may even be some happy surprises afterwards anyway let's start as i usually do with these the things that we did just to break your stuff right in particular if you're using these things they just don't work anymore for example if anyone is using the data type abs time rel time or t interval they're gone you shouldn't have been using these they've been deprecated for a long time these are not like t notice that it says t interval there is also a data type called interval we're certainly keeping that one it's really really nice to have we're not removing that but if you're on one of these old data types they're gone and one of the things i like to someone like that we no longer have support for non-strong random numbers like you can't compile postgres on a platform where you can't possibly get strong random numbers so if you were building postgres with dash dash disable strong random why were you doing that and also you can't do that anymore there will be a couple more things but those were the things that i got sort of under the list of things that broke without giving you anything else in exchange for it we have some other things that are just different where you will have to update but then you get something else in return as payouts for doing that update so let's talk about the actual new features that we're looking at here and i usually when i do these talks some of you may have seen me do them before i try to group them into the sort of four categories of dba and admin sql and developer backup and replication and performance that's it's kind of hard like the difference what's the difference between dba and developer in a modern organization hopefully in your organization difference isn't that massive i just categorized them by sql and developer the things you do through sql and dba are the things you do not through sql you have to split it up somewhere so let's talk about some of the administration side of things if you are using ssl in your postgres now we have some more information for you in the pgstat ssl view in particular we can get you the client serial number of client certificates and the issuer dn of the client certificates if you're using certificate authentication the other thing to notice is we're not masking data for unproviled users in this so you will only if you're not super user or a pg monitor when you're connecting into this or reading this view you will only be able to see information about yourself or roles you remember all but but you want you have to be super user or a monitor role to see all the data about all the different connections that's the same way that pgstat activity already works so there's nothing new there well it's new that it applies to this view you now also have the ability to control the minimum and maximum ssl version that postgres will speak that the server will speak the default is it's going to set the ssl minimum version to tlsv1 and the maximum version to empty which means whatever your open ssl supports the supported you can have our tls version 1 1.1 2 sorry 1.2 and 1.3 we don't actually support anything prior to tls version 1 anymore we still call all the settings and all the views ssl even though it's all tls but that's just because it's breakage for no reason to rename everything but there is you can't actually configure postgres to run with the old insecure version of ssl anymore so move on who loves vacuum yeah i'm sure you do we got a couple of new things around vacuum we can say a center parameter the vacuum that is skipped lock where it will just skip vacuuming any relation where it's not immediately lockable so you don't end up with you know lock cues and things locking behind each other when you run a manual vacuum across the system you can also disable the page skipping this is one of those wonderful double negation right disable the skipping which means that normally vacuum will look at the visibility map and skip vacuuming the parts of the table where it believes it doesn't have to do that well if your visibility map is for some reason not showing the right data or if you're suspecting that it might not be doing that you can pass it the disabled page skipping at which point it will actually vacuum the whole table even the parts that it believes does not need to be vacuumed because we all know what happens if you don't vacuum things that actually needed to be vacuumed that's nothing good comes from that if you are using vacuum db to run your vacuums you now have two new parameters to be selective you can say min xid age and min mxid for multi transactions age where you can just say basically only vacuum tables that haven't been vacuumed in the past n that gives you some of the things that auto vacuum would already do in the background like you have these tunables now you can also get them for your manual vacuum when you're running this on your system so moving on we copy the postgres copy command which i'm sure you all know now has the ability to take a work loss when you are loading data when you were exporting data from postgres you've either been able to export a table or the result of a query right and that query can have a work loss or a join or whatever you want but as of postgres 12 you will also be able to apply a work loss when you're loading data into postgres so in the simple case where i'm copying my table from a csv file and i only want the rows where a is greater than five obviously a refers to then the first column in this table as you load the data now first i mean it's a csv file we're going to have to actually load the whole file into memory at some point like row by row we we can't skip within the file but we don't have to separately like previously if you want to do this you have to load it into a temporary table and then do an insert where you can now do that at the single step now unlike your regular again if you're exporting you can use a sub query and just use join or whatever you want you this can just be a very simple work loss expression you can't join your csv file and then import if you want to do that you should look at using the file for a data wrapper which can create a virtual table that instead of being inside of postgres it's pointing to a csv file and then you can run any sorts of queries against that one it may not be quite as efficient for just pure data loading but it gives you a lot more flexibility and speaking of copying speaking of csv psql now has the ability to directly output csv format data if you might want it as part of your csv script i'm not sure it's really useful in interactive mode but some people might really like csv and want the interactive psql to be but if you pset your format to csv any queries you run will not be output in csv format instead of being output in the table level sort of try to paint it with akr tables that we do by default who's using pgstat statements today okay although that this is sad most of you are not using pgstat statements you guys should start using pgstat statements go home and look at that it is really really useful it'll tell you what's happening in the system the new thing that we have around pgstat statements in this version is you can now reset an individual query so for those of you are not using pgstat statements it collects your most frequently running sql statements and gives you statistics about them it normalizes them down to to sort of anonymize your parameters and groups them and you are now able to using this pgstat statement reset which would previously reset everything now there is one that will reset just an individual query and if you want to reset the individual query you pass it the query id which is in the pgstat statement view and then we'll set reset just that one okay let's move on and look at a few of the things that will affect you on the sql and developer side so speaking of the breaking things who is using who's creating tables with oids today you're not doing that and those believe you and okay good you shouldn't be doing this it's been deprecated since 2005 it is now actually removed so you can't create tables with oids anymore you can still use oids but there's the separate clause that you put at the end of your create table that said with oids where postgres would add a system column called oid with a potentially unique number as being the oid it was unique until it wasn't right because it was a 32 bit number then you know we all know 32 bit should be enough for everyone but then we also know it isn't the other thing is as you may know all the postgres system tables also has oids and they've been very special and magic in how they were handled internally in postgres they're not anymore they're not just another column now your queries will not be affected by that that will still just work but any extensions you have will not work but if you're like your where clause the column is still named oids so if you're joining two system tables on oid it will still work it will also be included in your select stars but nobody's running select star right you all know you're not supposed to do that so the other big change around that stuff if anyone of you using enums you know that this is one of those annoying things that you couldn't use if you were in a transaction which maybe you didn't actually need it but maybe you were using a tool that would always put you in a transaction and you couldn't actually add values to your enums you could use your enums I mean for the interest otherwise they would be useful useless but you couldn't add new values to be made available to the enum now what you can do now is you can so you can inside of a transaction you can alter your enum and add another value to the available values to the enum and since it's inside of a transaction you can roll it back or you can commit it or whichever you want in previous versions of oscar's this one would just give you an error saying no you have to run this outside of a transaction block of course we now have a different error message for you if you try to use it within this transaction so if you do add it also type e dot add value you can insert that value within the same transaction you have to actually commit that transaction and then you can start using it from any session but it can be part of like if you're doing one thing that i've seen like you your application migration or of course running in transactions they should be and then you couldn't add things to your enum but that part of it will work perfectly fine now right you will be able to add them and in my migration you typically don't use it until you get to the next migration or the actual application we have a parameter now to control the plan cache normally postgres uses the method that's not on which auto and you couldn't set it before you can now control for your session whether it will use a custom plan or a generic plan the difference is with a custom plan every time you run the query in question it will replan the query the generic plan is when postgres detects that it can make and reuse the query plan for this prepared statement or for this plpgsql function for example the query is inside of it and you can now explicitly turn that off or on if you want to but i think the usual use case for it is to turn it off in the cases where you get a generic plan that works for everything but it's really bad or it's really bad for some numbers on the optimization side who knew who knew just as a query who knew that cte's you know the with the queries in postgres are optimization barriers that postgres can't optimize inside of a cte that is far too few of you i'm sure the rest of you probably have queries that suffered from this because you thought you could just use cte's and make everything readable and very easy to write and work well and run fast except it didn't necessarily run fast uh and this used to be the case in postgres and it's been argued for quite a few releases by now whether that's a good thing or not and in postgres 12 the people who said it's not a good thing finally won out uh so if we're looking at this just as an example here uh this uh a simple query like this i just create one select star from foo another select star from foo cnt2 and they just select union all that will run both cte's right and if i run this on a postgres 11 or 10 or 9 it will come up saying okay i'm gonna do a cte scan sorry it's gonna do a cte of this one cte of that one which basically materializes them and then it's gonna scan these two cte's and that's the way we see like this optimization fence is in effect uh so if we were doing joins within these like it can't propagate filters into the cte's of course i was too lazy to bring an actual example with real world data this is a single table without a row but you know that's what you get now the difference that we have uh added to the parser in version 12 is other than just saying with t as select star from foo which is the same one you can also say as materialize select star from foo if you add the materialized keyword then cte's in postgres 12 will behave like they did before if you don't add materialize the default has changed in that we will now inline the cte and optimize through the cte so if you were actually using cte's as an optimization barrier you need to put this materialized keyword into your queries otherwise the optimization barrier goes away now what i'd say is if you were using them intentionally as an optimization barrier when you upgrade still try running them without that you may maybe you didn't actually need the optimization barrier anymore it might have been necessary when you put it in you know five or six versions ago so if we look at the output of this one same query same table but now running on postgres 12 you notice we still have this cte t2 and a cte scan on t2 because it had the materialized keyword for t2 but the first one with t as select star from foo has actually just been inlined to directly select from foo and obviously then it becomes more interesting if you have an actual query more than just select star from you might be joining you might be having work losses that are on the outer part of the query you have a work loss out here that will now get fully materialized into the cte itself and therefore optimizing it quite significantly okay let's see about backup and replication i'm not even gonna ask who has backups i know you don't have backups the parameter maxwell senders controls the number of replication connections that you can have into postgres it was previously a subset of max connections right so if you had max connections to 100 which would be the default then you had max well senders to 10 that could be for example 90 regular connections and 10 wall senders if you had 95 regular connections you could only have five replication connections even though you'd configured it for 10 the difference now is these have just been separated so if you have max connections to 100 and max well senders to 10 that means you get 100 regular connections and 10 replication connections and they're just coming out of different pools which makes it a lot easier to actually configure that and to not run it because running out of replication connections can be pretty bad because your replicates don't work anymore your backups don't run anymore and kind of things like that so it's a good thing to be able to lock that one down another big feature that has been talked about in postgres for years and years and years is the recovery.conf file recovery.conf doesn't really exist anymore in postgres 12 it has been integrated into postgresql.conf so all the parameters that you would previously set in your recovery.conf which would be you know your primary con info if you were doing replication your recovery command recovery targets and all of those things they're now in the main postgres.conf file you can put them in the main config file or you can put them in a different file if you want using the include functionality of the config you can get them into postgresql.auto.conf if you're using ultra system set so you can now use ultra system set for recovery parameters and of course the other thing that recovery.conf does is it tells postgres to do recovery either from a backup or to do recovery as a standby server so to fix that there is now also a file called recovery.signal which has no content but the existence of a recovery.signal will trigger postgres to go into recovery mode and then read all these settings through the main configuration file that means that if you're using any script of your own to handle restores you have to update them to do this if you're using a tool for your backups like backrest or barman or whichever one you need to upgrade them to a version that knows how to do this once you get to postgres 12 so all your external tools dealing with the recovery side of backups need to be upgraded for postgres 12 compatibility the backup side obviously doesn't change for this this is only on the recovery side and there are of course some reasons for it i mean we didn't make this change just so that you would have to upgrade your scripts so in fairness sometimes i mean it's a good thing that you look over your recovery scripts every now and then but the real reason is we're now able to reconfigure a number of things that were in the recovery dot com file while only reloading postgres not actually having to restart everything the ones that we have in now there are a couple of the ones that aren't used all that much but you can for example change your recovery min apply delay which is if you have a standby server that is time delayed you're saying i want my standby to sit one hour behind the master you can change the value one hour without restarting for some say well you know i can restart the standby nobody cares i'll say well the guys who are running reports for example only they will care uh so they're a good use case for that likewise if you're using the archive cleanup command uh if you're a purified list replication and you're not actually using uh streaming replication you can change that one there are a couple of other things that aren't all that common there are more to come hopefully some will actually be committed for 12 uh these were probably the the lowest hanging fruits where literally the change to do this was just to change that you like no code change was required just a meta flag saying oh you can now change these uh so that's always nice uh by the way who's using archive cleanup command wow that's like two or three people and and all of you probably know the fact i just so far this year i've i've ran into two different clients who had archive cleanup command configured to point to their wow directory inside of PG data right if you ever use uh archive cleanup command don't do that right it deletes your transaction loads that you will not be happy it works as long as as your server never goes down which is kind of the whole point of having it so yeah don't do that uh another classic misconfiguration that a lot of people don't realize the parameter recovery target timeline now has a new default value the new default value is latest it is probably what most of you thought it already was because it's what you almost always want right you have to set recovery target timeline to latest if you have multiple standbites and you want them to to follow if you do a failover you would switch timelines and then your other standbites would not be able to continue unless you configure this to be latest i would say in in sort of the 98 to 99 percent of all cases you want it to be latest well you don't have to go configure your system to do that anymore if you're using a postgres version prior to 12 which i guess most of you are and you're using replication you might want to go double check that you've actually set recovery target timeline to latest so that you don't get any evil surprises uh department of small things around the backups there is no function called pg promote that you can run inside of postgres on your standby server that will promote it whereas previously you would do pgctl promote well you can now do that from inside of postgres uh and just because i'm talking about backups you know we have this old interface to do exclusive base backups that used to be the way to do it in postgres it's been deprecated for four releases now i think it's even more deprecated now there were some big arguments on the list about actually removing it but i think the the consensus is still we're going to deprecate it even more and then we're going to remove it so if you have backup scripts that are using the old style pg start backup and pg stop backup functions the ones where you don't have a persistent connection to the database you need to fix that and you need to fix that independent of us deprecating it because there are very good reasons for them being deprecated you can find yourself with a database that doesn't start or data that's corrupt or evil things like that if you're not really really careful uh so even more deprecated performance everybody loves performance right so everybody wants their database to run faster in the new version except for the ones who are selling new hardware or i guess now also the cloud providers don't really want you to do that because you'll buy bigger instances if it's slower but we do try to make it faster in every version uh so looking at a few things that we've improved uh sp gist the space partitioned gist indexes now supports knn searches something that we previously only had in the gist indexes the k nearest neighbor so basically if you have a query that is an order by a distance with a limit it can use one of these it can use an index scan to not have to compute the distance for everything throughout there is actually a patch that's on the final commit test in the queue possibly to support knn searches over standard b tree indexes um i don't think it's going to get into this version but we'll get it eventually but having it in the space partition gist is still a big improvement the postgres partitioning has been the headline of postgres 11 actually was the headline of postgres 10 but it became really useful in postgres 11 and there are some improvements on it again of course in the new version postgres 12 will have the ability to set more flexible partition bounds so that if you're doing like range partitioning you set the start and the end of it they had to be just constant before they can now be a generalized expression you can do things like i want a partition from infinity in the back until current until today and then another one from today to the future of course that's going to be materialized at the time when you create the partition you don't get a floating partition based on something like today but you can use generalized expressions you can use function calls and math and things when you define your partitions it'll make it quite a few things simpler in handling there are a number of changes throughout the lockings of partitions one really interesting one is the locking of an actual partition when you read it only happens when it actually scans the partition whereas postgres 11 and earlier when you started running the query it would lock all the partitions oh well lock them in a share lock but you would no longer be able to like drop partitions and things that that's what it was protecting against because the problem is if you have a large number of partitions just locking them takes a lot of time and waste resources so postgres 12 will only lock them once it starts scanning the partitions which means it will also only have to lock the partitions that are actually used and not the ones that ended up not being used you will also now be able to attach a partition to a table so the a pre-existing partition that you created on the side and maybe loaded with data with copy or something like that without taking an access exclusive lock cross the table so while concurrent activity is running in the other partition you will be able to add a new one or or attach an existing new one it will access exclusive lock that partition but not the other partition in particular not the master table that owns everything multi we will now use multi inserts for copy that's one of those things it actually surprised me when i saw it and apparently it's the price the guy who wrote the patch as well that if you were copy if you're copying data into postgres in a table there's an optimization called multi inserts where it pushes in multiple rows at the same time if you copy data into a partition table so you copied it into the master table that was just turned off whereas if you copied it directly into a partition it wasn't in postgres 12 it will use this multi inserts if you copy data into the parent table so i mean the the simple version is copy into partition tables will be faster and it will work exactly the same way other than that postgres 11 brought us gist compilation using llvm to compile expressions and some other functionality in the system it was turned off by default you had to explicitly turn it on postgres 12 will turn it on by default assuming that is that you have installed support for it you still have to have the compile time level support and the libraries have to be there so it depends on which platform you're on for example i think on most of the wm-based platforms and versions like ubuntu and ment etc it will install support for llvm gist by default it'll do so on postgres 11 as well it will just be turned off so there you will now be running it whereas on a red hat base platform it will not be installed by default because there is a dependency on packages from epel to get the correct version of llvm so you can it's still there but it's just a separate rpm that you have to install there was actually a discussion just yesterday or today about how to deal with that now that gist is supposed to be turned on by default so that's a work through of some of the things that are in there now but i said a lot of interesting things are still in the queue because we're early in the last of our commit fest and we're far from done right there's a whole bunch of really interesting stuff hopefully some of it will make it in all of the things that i mentioned here are definitely not going to make it in hopefully all of the things that i'm mentioning here will at least make it into the next version so you'll hopefully get something back out of it um we may get what is known as pluggable access methods and a pluggable storage model for postgres there was actually a commit made just yesterday on this that uh by following twitter a lot of people think that postgres now has pluggable storage it really doesn't you can say create access methods and drop access methods but it's not actually plugged into anything yet now it will uh what i would say the the target and what it's looking like now is that postgres 12 will have pluggable access methods the full layer that is actually being used with one implementation and that implementation being exactly the the storage engine that you've had in postgres for many years but it's building a foundation it's also there there are developments ongoing for some external storage methods in particular there's an interesting one called zheap which is undo base so similar to how oracle works for example but even if we get all of this pluggable access methods into version 12 that won't be enough to for them to finish this zheap running on this api right there is more needed so we'll have pluggable access methods you won't be able to use them but it's important infrastructure work to bring this into maybe postgres 13 or maybe postgres 14 uh where you'll be able to pick this on a table level picking different kinds of access methods and things like that so it will get there but not yet who has ever read the postgres documentation really okay that's a lot of people who have never read the postgres documentation that's uh i don't know if that's good or bad uh what you've noticed is there are no figures in it well we will probably have that you know graphs and graphical displays of different kinds of things i'm hoping to see that in the documentation of this version um simple thing kind of important uh we have a patch in the queue for being able to use uh serializable transactions that's be full on serializable using ssi or the serializable snapshot isolation that postgres has uh it doesn't currently work with parallel query so if you set your transaction isolation level to serializable parallel query gets turned off um hopefully we can turn keep both of these running at the same time if not you know you'll have to run in repeatable read which will hopefully be enough for most people but that's a good addition there's a patch in the queue for looking at being able to use generated columns you can do very similar thing to generated columns using views for example in postgres or materialized views depending on what you're doing but generated columns is a standard feature from the sequel standard and we don't do it basically column where well you can tell from the name right you create a column and say whenever i insert a row into this table the contents of this column should be generated based on these other columns right so it's like a materialized column instead of being an entire materialized view uh this one i'm really hoping for i'm not gonna give it super high likelihood at this point but it's not it's not looking too bad which is the ability to have a foreign key referencing a partition table uh you can today already in postgres if you have a partition table you can have the partition table reference other tables with a foreign key but you can't have other tables have a foreign key that references into your big partition table uh hopefully we'll have that we'll see exactly where that happens but it's not looking bad i think it's a decent chance of making it there is work in progress on a shared memory staff collector uh the statistics collector in postgres today the one that counts how many how much your indexes are used and how much your tables are used and things like that is currently based on each postgres back and sending data to the statistics collector through udp then then to read it back it's spooled through temp files and read back on the other side there's work in progress to move that into shared memory uh to generally avoid the overhead now the udp overhead is not actually that bad but one of the really uh expensive parts of the old system is rebuilding the hash tables on each side of the dump because we can't dump a hash table with the hashes but if we can put this into shared memory we should be able to do that there's work on progress monitoring for create index and for cluster uh you may have seen today we have a uh pgstat what's it called pgstat progress vacuum or vacuum progress one of those two which will tell you you're running vacuum for all your processes how far along they are within vacuuming a table well this will give you a similar thing for create index so you'll be able to query and say hey this create index that's running is it one percent done or 95 percent done it won't be exact numbers but it'll tell you one or 95 right and same thing if you're running a cluster command uh to being able to use this i think they have a pretty decent chance of getting in there is yet another attempt to give us reindex concurrently uh which is one of the it sounds like it shouldn't be that hard and many people have worked down that route and tried to write a patch for it and it turned out it was hard uh but hopefully uh we're still thinking that's not entirely unlikely uh to have that one there is work on a block level parallel vacuum right now uh auto vacuum for example can parallelize across multiple processes but on a single table it can only run on one cpu at a time right so if you have one massive table and everything else is small it doesn't really help you uh unless you partition the table so the idea here is to be able to parallelize an individual uh vacuum of an individual table across multiple cpu hopefully we will have a tool to be able to do offline enable and disable of uh page level checksum who's running with page level checksums in their database today the rest of you whenever you install a new postgres you should probably run with checksums enable the problem today is there is no way to turn it on or off now that's what we're hoping to solve now uh doing offline enable of checksums is probably not going to be super attractive because it requires you to rewrite the whole database while it's offline probably not that good the way to actually get checksums enabled today is to install a second database and do logical replication between the two into one that has it whereas disabling checksums is really fast because we don't have to actually clear all the checksums we just have to set the control flag that says we have no checksums anymore uh so that will give you be able to do it offline with like less than a second worth of downtime uh we're hoping to have online verification of checksums so that you can just run a script running against your live database saying verify all my checksums right now the way you do that today is really if you back your cluster up either using pgbase backup or using a tool like backpress it will validate your checksums for you online so if you're already doing that you don't really need this feature uh but it allows you to run that as a separate process there is a tool today that lets you do this offline but again it's not too much fun to take your your multi-terabyte database offline and run something that has to read the whole database but you can use the offline tune against say a snapshot version of your database something like that hopefully we will be able to change primary con info without restart so as I mentioned we could change the archive cleanup and view those things but actually being able to change for example the ip address of your primary without having to restart your standby uh is quite useful and hopefully we will have this already in process version 12 there is a patch in the queue for something called synchronous replay mode for avoiding stale reads on hot standby so everybody gets what that is right the the basic idea behind it is that you should be able to to get a consistent view across multiple standby in synchronous replication mode so that you can just load balance your reads without having to think about it today you have to actually think about it if you're doing interesting things with transactions and this should do for all of them and the final one I picked to to mention there is a patch in the queue for something called d toast of compressed datum which is right now if you have a large value stored post-press toast it and it compresses it and then if it needs to read a part of it it decompresses the whole thing just and then throws everything away and this is particularly used by things like post gis the geospatial data type like if it when they do searches they'll often only need to look at the very beginning to see if it's worth looking at the rest and right now they have to decompress everything and throw it away so this should be a nice performance improvement for that type of workload uh so that's a whole lot of features there is a lot more there this is like covered maybe 25 percent of the patches that are actually in the queue there is a fair amount of things in the commit that's now that people have already given up on post-press 12 for that's already been flagged for being for post-press 13 but there's a lot more that's going to come and I'd say there's always going to be more stuff here there's a lot of lots and lots and lots of smaller patches many performance improvements they usually say if there are any contributors in the room and I didn't mention your patch I'm sorry but there are hundreds and hundreds of patches that have gone into the new reversion so at this point we really do need your help to deal with it if you are a c programmer then by all means please go on to the commit fest of post-press dot org find a patch and review it there are definitely a whole bunch of patches in there where you don't need to be a post-press expert in order to review it you'll probably have to know c programming it'll be difficult to review in the mother than that but getting extra feedback in from that but also getting feedback from all of the rest of you who are just users of post-press you're running post-press right please download and test this stuff you may not you know be quite as crazy as I am and run post-press 12 in production I wouldn't recommend that but download it and test it if you have tests in your application please run them again post-press 12 you can today get apt packages you can get rpms through through the yum repository that will give you snapshot versions of post-press 12 we'll eventually have a beta version out usually that'll be around may at which point it'll be easier to deal with because if you if you run it now you may like it's not guaranteed that you're upgradeable from the snapshot version so in theory you may end up having to dump reload to get anywhere but if you're running it for your test then that doesn't really matter and if you can tell us already now that something broke your test or that something made your test run a lot slower than they did before it's a lot easier to get that fixed now than once we release or you know once you upgrade to this version a year from now and it's already been out for sure and it's a lot harder to fix that bigger surgery that's required can be done now you won't be able to do it then so again there are packages available of course there's also the source code available if you want to build your own version of post-press just get the git repository run the configure install it and see what happens and of course we're also interested to hear positive feedback not just that you broke everything but also like hey this workload actually runs a lot faster than before or or this workload that sounded like it should run faster didn't so we get that sort of feedback that's all I had to cover first this time I think we have yeah we've got about five minutes if we have any questions yes who wants to run around Joe you should get Steven to run around as I can see your knee is hurting so how do how do cloud providers provide feedback to us most of them don't to be honest there is a little bit of feedback there there are the several of the big cloud providers for the last version of post-press at least provided the beta version to their customers but we didn't get a lot of feedback from the cloud providers we got some feedback from their customers but not from the providers themselves yes okay so on parallel processing there is not a lot of things in post-press 12 but there were a lot of things in post-press 11 and in 10 right so post-press 96 was the first one where we did parallel query and that's the one that basically required sequential scans and only did aggregates 10 made it much more useful for like general queries and 11 added a whole bunch of more things so I think the main thing that we're hoping for here that could be really big for parallel query is if we get serializable working on top of parallel query that's that's the really big thing into 12 I don't think there is anything else in particular around that but if you're not already on post-press 11 then you will see a lot of things there if we if we parallelize vacuum yeah also 11 added parallel create index which is also very nice anyone else okay and thank you very much let's get going hi everybody I'm Chris off Pettus there's me and we're here to talk about replication in post-press first of all who's doing any kind of replication in post-press right now any kind okay this side of the room is doing replication was that it did you set that anyway um we're going to do this in kind of this happy little breezy way we're going to talk about the replication about this replication option which is and we're going to kind of do a trip through time here which is from the oldest to the newest kind of post-press replication so we're going to start talking about wall shipping then streaming replication trigger based replication logical decoding based replication the new hotness and we'll end up with a couple of exotic animals for um and my voice is very about and causing feedback um a couple of exotic animals in the in the replication menagerie uh just a quick introduction to myself I'm Chris off Pettus I'm the CEO post-press fuel experts we're a boutique post-press consultancy we're up in Alameda California we're actually in so Alameda there's this little bit of the main one this not the island um and I've been doing post-press since 71 okay everything about replication in post-press is based on the right ahead log so we're first going to talk about the right ahead log because every single thing that post-press does in replication is based on this um most of you probably know what it is but I'm sure some don't so we're going to talk about it briefly it's a continuous stream of database changes written to local storage every single database in the world that does data integrity has something like a right ahead log um and it's um in post-press it's broken into a series of records and each is a specific change to a specific underlying disk level item like a row um if you if you um you know first it to talk to you in English it would say something like okay this what I'm doing is updating field number three of ctid this of relation oid this to cat that would be the kind of stuff that's written in the right ahead log so it's very binary it's very low level it's very specific to a particular disk image this is one of the reasons that if you do a pg dump you can't reapply the walls the pg to the restored database that you did with a pg dump because everything's moved around um the ctid is actually a way of identifying a particular row in tuple in post-press if sometimes if you if you've never done this go into a table and select and say select ctid from table and you can see the ctid it's actually a real-life column that's a page at a tuple within that page you don't really need to know this but the important point is it's this stream of binary changes the original intention of this was for crash recovery so that if the system died it could replay these back onto the disk to um so it could pick up changes that had not yet been written to disk um they were still cached in um still cached in storage um on disk is the wall is written like is the series of 16 megabytes segments and they all have a bunch of hexadecimal names like this they used to be stored in a directory called pg underscore x log that was changed to pg underscore wall because people were go were seeing this giant directory labeled with a log in its name and going into leading the files because after all they're just long files they must be optional people were really doing this this was a big enough problem the the the community changed the name of the spot of the directory so the original idea for the right ahead log was crash recovery um on start up postgres goes to the wall segment that contains the last checkpoint the last time everything was written flushed out the disk and starts replaying all the activity in it to to to to and rebuild the event once it reaches a consistent state basically everything you know it's a clean disk image again the database can start receive connections if postgres was ever crashed restarted and you look in the um in the text blogs you know the kind you can delete um you'll see these records so they're saying well i'm replaying this i'm replaying this reach consistent state it's okay for connection but then around version 8 there was this interesting point they thought hmm this is interesting well what if the system that during the recovery doesn't have to be the one that crashed well we knew this was true already because for example if a system was destroyed and you restored a point in time recovery backup and replayed blogs there the system would come back up just fine so it doesn't have to be the literal physical system that crashed okay that's good well what if instead that original system was still in operation though it wasn't we weren't talking about a restore for a backup of the system structure what if there were two systems and one of them and then what if we just replayed the wall segment on the other system at the same time it was being generated from the old system they would kind of keep up with each other that's a really great idea in fact it was such a great idea it was the first kind of replication of postgres uh binary replication of postgres and it came into postgres in version 8 so wall shipping how does this work when the primary completes the wall segment it runs this command that's set in postgresgo.com called archive command archive command can do whatever it wants it could it well it could delete the segments that would be crazy it could um it could be flash bin flash true you see that occasionally because you people turned off replication but what it usually does is it picks up a file and copies it to a second to the secondary machine then the secondary runs restore command which is at the moment in recovery.com and will be moved to the main postgres configuration file in version 12 to replay the wall information so it's this continuous process of archive command waking up and saying oh okay i'm going to take this and move it over using some command it can be scp it can be rsync you want it to be rsync not scp and um restore command is run continuously do you have a new wall segment do you have a new wall segment do you have a new oh you have a new wall segment okay copies it down literally copies it into the postgres into the it's the x-log directory and then runs recovery on it and it's done it's um so it's very simple um it's it's very easy to set up you just need to be able to copy files around there's nothing special about here you just need to be able to move a file from the primary to the secondary somehow it's great on slower unreliable networks like land since you don't need a long-term persistent connection you just need to be able to copy a file that's pretty useful and if you keep the um if you you can use it as a basis for point-and-time recovery because if you keep the wall information around and have the backup you can do point-and-time recovery with it looks like a question being born over there because when you say i'll talk about the rule when i talk about setting it up it's because you literally copied the files from one machine to another so it's that you have exactly the same file layout on the two machines it's specific to the data in the file yeah but but you know it doesn't it doesn't care about how it's laid out on the spindle you know it's like you could you could be running exe 401 and xfs on the other and that would work just fine yeah well different on different binary architectures if they're both amd 64 that'll probably work okay if they are if one's amd 64 ones power pc that will not work okay so um like if they're different indians forget it um the nice part is it does also work a really old version of postgres like 83 that you shouldn't be running anymore anyone running a version of postgres before nine notice my hand is staying up but i'm a consultant i have an excuse it's not always my choice yeah all this one the oldest one of our clients is still running is eight is eight one which is always exciting when i see their number on the on the uh on the color id so there's more good stuff about wall shipping one of the great things about postgres possibly one of the nicest features about postgres when you compare it to other database products is dbl changes are law laws so and they're transactional so for example you can do and begin do a bunch of create tables do a bunch of drop tables and a rollback and it works this is unique to postgres why people put up with this nonsense on any other database i have no idea so they're pushed out automatically too you don't have to do any fancy migration scripts you create a table on the primary gets created on the secondary you're done um allowing for replication lag the secondary is a perfect mirror of the primary every single change minus some very small exotic animals like like um unlock tables that we aren't going to talk about is um is pushed down to from the primary to secondary you don't have to do anything special you can't miss a change um and you can read from the secondary um if you set it up that way um so you can do use it for load balancing and read only query this works with wall shipping it doesn't you don't need streaming replication for for load balances and failover is really easy you just promote the secondary and let it come back up takes usually takes under a minute um very usually much less than a minute so it's um all very simple and of course you can't stop as you probably get from the title of the slide um the secondary is only up to date is that last 16 megabyte wall segment you can lose some changes so if you're at 16 megabytes minus three bytes and your meteorite comes down and destroys your primary oh well you just lost that data you won't have corrupted the secondary in the sense that it will get errors or it's unreadable but it'll be back in time compared to the primary um you have to manage these wall segments because you can run yourself out of disk for example suppose the secondary just suppose you're just r-thinking you're doing the most basic thing imaginable just r-thinking from the primary to the secondary and the secondary just goes down you know meteorite the meteorite bounces off the roof and hits the secondary instead um the archive command will just keep working running and running and running say I need to copy this I need to copy this if the archive command fails it will not it will not throw away the wall segment it'll keep trying until you run yourself out of this space or notice that this is a problem or bring the secondary back up before similarly if the secondary goes down it's no longer consuming these wall files so your archive cleanup command which is a different command inside of recovery.conf it disposes of ones you you doesn't need anymore it's not running anymore so the primary is visually copying them over and then the secondary runs out of this space and now archive command fails and so the primary start they start building up on the primary until it runs out of this space this is really embarrassing um don't get yourself it and it's actually a little complicated to recover from that situation because you can't delete wall segments because you're interfering with crash recovery so don't do that. Using the basic in-core tool the things that come in the community edition replicating to multiple secondaries requires some orchestration because you just get one archive command that archive command has to have to push them out to all the secondary so you you know what do you do you copy them to a centralized server which distributes them and blah blah blah those all not all work but you just need to but you do need to set it up. The secondary can't be written at all and Postgres has a very strict idea of what writing to means you can't create a temporary table you can't refresh a map view there's a lot of things you just can't do on a secondary and um there are some other things you have to be aware of since the world's a global resource across all the databases in the Postgres server so you can't pick and choose anything um you have to replicate all the fields and all the columns and all the tables and all the databases every single thing gets replicated you can't say just this database can't do anything like that with binary replication and you can't consolidate multiple servers into one using wall shipping a very common thing that people want to do is they have a bunch of front-end servers and they want to dump all dump the data down into a centralized data warehouse not with this technology and this is the number one question of okay here's our upgrade strategy we'll set up a binary replica in version 10 and it will receive changes from our 9.6 server and then we'll cut over it'll be great yeah can't do it you can't replicate between major versions of Postgres using binary replication so nine six um nine six five to nine six six is fine ten one to ten three is fine ten to eleven not five okay and you know for all of version eight wall shipping served us well but you know it did have some limitations and then well what if we didn't ship the file what if we just transmitted the wall information down to network connection you know we have a network why not do that the secondary can stay a lot closer to the primary because we wouldn't have these 16 megabyte chunks flying over and that's what really all streaming replication is it's the same pretty much the same wall information but it's transmitted down to the secondary over directly over network connection rather than the swinging these files around it's really not that hard to understand yeah no after after it there's there are some differences between the between them but it's really not that important you know it's interesting it's interesting from a theoretical point of view but it's pragmatically makes a little difference um so recovery dot the basics are use recovery dot com by setting a line called primary content so content so to um to point the secondary the primary the secondary connects the primary the connection is initiated by the primary and goes to the secondary and then the second having done so the primary starts firing wall information down to it otherwise under the same as worshiping good part the secondary seems to close the primary in terms of transaction activity can stay extremely close to the primary with optional things replication you get the chance of a lost transaction that is to say one that's been committed on the primary but not on the secondary is essentially zero this is a really big performance yet just to be clear don't turn this on just for amusement value you can also turn this on on a session or transaction basis so you can say no this next transaction is really important and needs to be seen um the recent version of Postgres also have a lot of interesting different ways of handling this like form commit so you can say well give come back when three of my five primary secondary can all have committed it things like that you can also cascade reference you can stream from one to the other to the other the biggest chain I've heard of is 255 it takes a long time for the long information to get all the way down yes there's well the synchronous replication there is um there's no latency sla or anything like that the way it actually works is it sends over the change waits for the secondary to commit it sends the sends that notice that the commit has happened back to primary and then the primary returns from its commit well it depends on yeah it depends entirely it's also somewhat variable because it depends on where on how long it takes the secondary to actually acknowledge the flush do the flush and acknowledge the commit back so um it is just throwing it on turning it on as a global setting is almost never the right thing to do I'll just so there are some weirdnesses on wall base replication the first is replication to life um this this is different from replication lag just to be clear replication lag is fairly well understood there it takes a certain amount of time for the for the change to get picked up thrown across the network committed down to it and become visible for queries that's your application kind of has to know that if an application does an immediate read after write it could end up with data that it doesn't that it wasn't expected with older data replication delay is something a little different so when a wall changes for the data comes into the secondary and that secondary is running a query that uses that relation what should we do for example you're doing a bunch of inserts in the table and you're doing a count star on the secondary and a bunch of changes come in now remember these are page level changes offering at the binary level they're not transactional so they aren't going to work like a transaction would on the same machine so what do we do in this case if we just went ahead and applied the query we could get a wrong value the query could return return incorrect value because the data on disk changed out from under it in a way it didn't expect so we have two options we can delay applying the change until the query complete we just stop the replication screen wait wait for the query to complete and return the value and then resume the replication stream everything's fine or we can cancel the query we say sorry something happens retry your query and what the way postgres handle set is you get a choice you can delay the wall stream up to a timeout and that timeout can be set to infinity which means delay forever and after that it'll start canceling queries these are set by two different parameters inside of postgres go dot com max by streaming delay max standby archive delay those are different depending on how it's getting the data either over a stream replication connection or by wall shipping so higher you can the higher the setting the more the potential replication lag because it can stop the replication stream waiting for a query to finish the timeout starts ticking at the first time a conflict happens so as soon as a page change arrives and the conflicts with the query this timeout starts ticking and when that timeout then expires if it needs to it'll cancel every query that's affected so query could actually be cancelled after a much shorter amount of time the typical way of handling this is you want to dedicate a fail a server just for failover with these sets of zero and other servers for read only traffic with higher values having your failover server have a high value for these means that you increase the data loss potential quite a bit because if you set this to like three hours and you're doing giant analytic queries you could lose up to three hours of transactional data on the prime on your failover machine and you probably don't want to do that there's another setting which is off by default is very frequently turned on called hot standby feedback this a very common situation is that you're running along with your queries and suddenly you get a cancellation message message and you say what i wasn't doing any queries on that on the primary and you probably weren't what was happening is the primary was vacuuming that table and vacuuming generates page changes just like inserts and updates do and those can cancel queries just like inserts and updates do hot standby feedback since if it's on it sends feedback upstream telling the primary what tables are being queried on the secondary and says if you can just hold off a little bit for handling these which is which is very useful but it can also but if you are pounding one table very heavily on the secondary it means it will never get back and that can cause significant table bloat so and it doesn't completely eliminate query cancellations because of course you can get these because of a forced auto vacuum if for wraparound reasons or you're doing inserts and updates so the the one line bit of summary there is it's a good idea but monitor the bloat if you're doing this because if you're starting to get heavily bloated tables you may need to take a different strategy there is also a parameter called vacuum deferred cleanup age don't bother that again okay um it it doesn't really you know it's a little it's it's it's not deprecated as such but it's optional you know this one okay well before we even had low base replication how did we do this you know how did we do secondaries and back and and fail and hot fail over and all this stuff and what if we don't like all the limitations in in low base replication has a lot of restrictions you know that all or nothing saying that you can't write to the secondary thing um the no the no major same major version thing is a big deal the postgres has a really elaborate trigger mechanism you didn't do a lot with triggers what if we attach the trigger to every table caught these insert and update leaves um operations and push them to the secondary that way you know we have a trigger an insert trigger and the trigger fires and writes to a cue table and then we have a daemon somewhere and then picks up the stuff and reads it and you know we fall down like this well yes we could do that and actually this predated wall base replication the form of sloney one this was the first way we did replication on postgres now we have a lot of tools but we're going to talk about the three most common ones there's sloney which is written in C lundi which is written in python and buccardo which is written in pearl and plus some others that and they all basically work the same way so the big part of these tools huge more flexibility um you can depending on which particular package you pick you can replicate only some databases replicate only some tables replicate only some fields filter changes based on rules for the primary before sending them over buccardo you can actually replace the whole thing with a select statement and get really detailed for example if you are on a pc i environment you could write it so that it applies to masking function before doing the replication so you're masking the credit card number or if you're hip up the the the the healthcare information lots of very powerful stuff here more good stuff you can build really exotic technologies you can build meshes and grids and all sorts of wild stuff um you can consolidate multiple databases down into a single database for data warehousing yay data warehouses um buccardo only of the ones we're talking about those multi master replication where you can replicate between the same two tables on two different machines you can use it between different postgres versions so you can use it for zero downtime upgrading you know zero should probably be in quotes there obviously there's cut over time but you can set up a a secondary replicate replicate replicate and then at the moment quite the application re-home and bring it back up you're done don't even have to restart the target database server very nice so these are really tedious specifically to set up um there's lots of gotchas there's lots of weird stuff things just stop working for some mysterious reason all sorts of things um every table that has to be replicated needs a primary key at least a defacto one you know so you have to have some way of because since it's capturing a row and pushing the row over to the other server it needs to say wait some way of saying which row which is what a primary key does those initial copies could take a long time if you have a 12 terabyte database that you have to do the initial copy on you know that's not going to take um if you're doing this and wall based replication for you as your failover strategy that you you're going to have to kind of sort out how everything is going to work because the secondary can't be a replica source because for you know in most of these in these architectures and how is that all going to work and the answer is i'm not 100% sure to this day um and you're basically doubling all your rights every time all these triggers are firing all the time and all these long tables and all this stuff that's gonna that will have a performance impact and that automatic ddl distribution not here you're that's on you you're going to have to figure out how to do apply schema changes to both the source and the target so of the packages floaty tends to be the highest performance um but it does require c language extension um you have to have uh one these requires pl python you um bocardo can work entirely outside the subscriber system it doesn't need anything special on the subscriber which is very nice because if you're for example on rds where you don't get to choose what extensions are available you use a target on rds which is very nice um bocardo also supports multi master and primary key updates um multi master operation and primary key updates which the others don't um if your application fits into the restrictions of of um or i'm sorry if you can use if you're on version um a version that supports modern logical decoding based replication which we're about to talk about use that instead these packages are they work they're not you know hugely buggy or anything but they are they were written at a time we didn't have a different way of doing logical replication we have better solutions now um they're still very useful for major version upgrades when you're before 9.4 on the old version by the way there are no there are no versions of postgres that are earlier than 9.4 that are still supported 9.3 is now eol and 9.4 goes eol early next year so um think about your upgrade strategy um so and sometimes um you require for specialized environments where you don't have access to the logical replication of the wall stream um rds tradition up until relatively recently you didn't have access to logic to in core logical replication you do now so even on rds it's a little less required okay in version 10 or well actually much before version 9.4 we finally got in core logical replication we got into the form of a new framework called logical decoding it's almost stream replication is a package system it's just it does everything for you all you do is set the configuration push the button and it works logical decoding isn't quite like that um it's a it's a it's a framework for implementing logical replication and other things um it really required 9.6 to get going although it does work on 9.4 the the way logical decoding works is the the wall stream is presented to a decoding plug-in and the decoding plug-in then can turn it back into others can transform it in other ways for example you could do it is to say well when i said field 3 of ct id of relational i i become what i meant was update this to that for the idea is this to a first approximation it doesn't literally reconstruct the sql string to make the actual sql or build actual sql string but it does get you closer to the logical level from the the binary level it works through a construct inside postcode replication slot it's a name database object that captures the wall stream you can think of it as an output port on the database where the wall is being fed into um once you create the slot the framework delivers the decoded wall stream to the slot specific plug-in which can do whatever it wants with it um the plug-in reports back to the framework when it's processed the wall stream so the segments can do recycle um the horrible truth about replication slots is the replication slot keeps track of the wall position in its consumer in the case of logical replication the plug-in so if the consumer stops consuming the framework just retains the wall information so it can catch up which results in the wall segment stop being recycled because it may need the at any moment that consumer can just reattach to it reattach and start consuming again so we you now have a whole new way of running yourself out of this space so monitor this space one of these replication plug-ins these magical things i keep talking about it's just a piece of c code that's installed in the primary server it would like any extension it receives to change a stream of decoded of decoding wall records it can do whatever it wants with them it can log them it can audit them it can feed them to an external data system um i think this is no longer being supported there's a new one i'll have to update this but you can do logical replication into Kafka and just speed all your changes to a Kafka bus that's kind of cool um it ships post chris kills ships with a test plug-in that provides example coding logging but it's not uh i keep saying it's not just running production use and then someone comes back up to me and say oh i just migrated my 12 terabyte database using the test decoding plug-in is like you know people are amazing um but what you really want to do is you know what we're all in the room for is i want to do logical replication between postgres servers thank you very much so in postgres 10 plus we have built-in logical replication on postgres 9.4 there's the pg logical extension which is not in core but is not part of the distribution but is a third party extension from second quadrant so the high level view of how these works is each takes the stream of decoded changes and applies them at the sql level it means that constraints are enforced rows are locked triggers conspire mvc happens it was just as if there was a stream of sql statements coming in more or less a database can be both a publisher of changes and a subscriber to changes so that's very nice you can have you know wacky topologies where things going all over the place and a single table can be both a source and a target however a single table cannot replicate bidirectionally however neither technology so you can't ping along between two tables um generally you set things up in about the same you know the same way on both you'd use pgdump schema only to copy the schema over just the schema no data um both when they um when they connect they do a initial bulk copy of the existing data followed by replicating data going forward so you get the initial snapshot and every change after that um ddl changes are not propagated by any technology pg logical provides a function that will let you do it it doesn't do it automatically but it will let you as a function you can do it in core replication leaves it up to you how you want to do it um in both cases we need again to be able to identify individual rows for a logical replication to work um they they so everything needs some kind of row identity the ideal situation is a primary key or unique index um pg logical either requires either a primary key or a single unique index on the table um in core um can use the entire row value to identify the rows all fail so of course if you duplicate rows that's going to be a problem yeah yeah they the the keys can have multiple columns as long as they're declared as a primary key or unique sequences are a unique problem on logical replication the problem with sequences is that um they're not decoded so they're not replicated so the row values that are set off of them are but the actual value of the sequence itself is not so pg logical has a um has has a background process that replicates them in batch um the problem is if you're um consolidating a bunch of things to a single table this could be a problem if you start getting key conflicts so use this means you need to use district ranges from the source databases or non-sequence keys like uu id or snowflake or one of those guys i like uu id um because of this logical replicates are generally not suitable for failover because the sequence values are not going to be set are not going to be up to date and you can get all sorts of wacky problems um at present um actually version 11 has changed but um this is a 10 era presentation uh sorry about that it does not replicate truncate at all um pg logical replicates truncate but doesn't replicate truncate cascade so if you do truncate that's going to be an issue um you can only replicate from a real table to a real table so materialized views views foreign tables or partition root tables things that kind of look like tables that aren't can't be sources um can't um can't be sources of target what a view is um if you're using um postgres 10 partitioning the root table is not a real table so it can't participate logical replication either source or destination um old-style partitioning if you're using it um it should still be possible with enable always triggers um i haven't actually built something to test it yet but it should work um logically coding uses the wall so it can't replicate temporary unlogged tables because those don't appear in the wall um copy operations are broken into individual insert because that's how they appear in the right ahead long they just appear as individual so you um so if you do a copy of low 10 million rows you're going to get 10 million inserts on the target table on the target database um between the two pg logical has a few features that in core replication does not at present um it has more flexible conflict handling for an insert into a source database results in a primary in a unique constraint violation you can do row and column filtering sequence replication a few things um pg logical requires extension to be built and installed which is not far the core distribution and uh pg logical is operated by functions in core replications sequence saving if you can um only a primary node to be a publisher or subscriber subscriber obviously because you're changing the database so you can't and a binary replica can't be changed unfortunately you can't create a replication slot on a secondary right now so you can't um you can't use it as a source um if the primary if a primary has logical subscribers and you fail over to the secondary the core current logical replication state is not passed over to the secondary which means that you don't you can't guarantee you're exactly in sync with where the the primary thought it was on the logical replication screen so you can have synchronization problems here so just be prepared for that um for example there could be changes on the streaming secondary that haven't been pushed down to logical replication subscribers so the secondary could be a head of a logical replica and there's this window that doesn't never gets replicated I should draw a line there is a 13 who's running on rdf fewer people than I would have expected you know often like every hand goes up um if you're on 9 6 11 that specific minor version and higher it supports bg logical on 10 it supports in core logical replication finally um you can use for both rdf targets and non-rdf targets so you can replicate out of rdf using it which is nice you can't supply your own logical decoding plug-in so that thing about replicating the Kafka can't do it on rdf sorry okay uh just to close up there are a few exotic animals kicking in out there on the replication zoo um who's ever used pg pool too who enjoyed using pg i i feel so bad because people put so much work into that project it is so quirky i mean but it does what it does um it has a feature which i believe is defecated now um may even be scheduled for removal where you can take the incoming query screen and split it between two primary servers so all replications are applied to both all operations are applied to both now if we take a moment of silence we can understand all the different ways that could break so please don't use this feature we have lots better options now it's a very old feature at pg pool that predates streaming replication um amazon dms um on postcript this is based on logical replication it's a general um database product to database product data migration service um runs on supports a very large range of both source and target database products it's primarily designed between for migration between different database systems so migrating from your you know marina db into postgres or wherever um it doesn't support some kind of important types like timestamps easy um so it's not really useful for postgres to postgres migration um replication and you know amazon in fact says don't do that with it so it only looks so appealing but it's really not suited for that um second quadrant which is a commercial postgres support company has a product called bdr for bi-directional replication it shares many similarities to pg logical because they have this they have much the same underlying code um it's currently a closed source proprietary product um it has been indicated it'll be open sourced in the future so you're looking looking forward to that it can do bi-directional replicate multi master replication um and it does mesh it does this in a mesh topology so everything so if you build you know you have 12 nodes they all replicate to each other there are tons of commercial solutions out there um you know just doing database rough database migration or database replication and you'll be buried in google yes um they're pretty much all trigger based at least in the postgres world um because they're they're generally most useful as a solution for package between between database migration or things like feeding from your uh postgres database into some data warehouse into a commercial data warehouse solution um none of them i i'm not none of them are bad products or anything like that but generally they're they they're aimed at rather general purpose postgres to postgres migration so knowing all that what should we do for failover use streaming replication that's what it's good at so when you you have a failover secondary that's going to take over if the primary goes down that's what streaming replication is for for read only queries where you're doing load balancing and you and you can accept a server that can't be modified in any way use string replicas that aren't dedicated to failover so the replication and be aware that there'll be rough that the high query load will result in replication if you need logical replication um use in core logical replication unless you need a pg logical feature because it's in core supported by that it's easy to use and um that's it thank you so much so in the case of logical replication you can imagine what will happen is a row comes in but there's already a row that conflicts in some way and the question is what do we do the answer in in core logical replication is we stop we say we throw an error the replication stream stops and it's up to you to sort it out uh pg logical can do that or it can throw away that it can always throw away the change it can always write the change the new write the new record or based on um uh or it can say well i think this one's newer than the old old it'll based on whether or not it thinks the old row or the new row is newer based on algorithm as described in the pg logical documentation so it has a little more flexibility in that regard you know you i would be a little bit hesitant on building a system where that was kind of expected behavior but the um but if you maybe someone has a day a solution that requires that so it has a little more flexibility there any other question um if well the the data that is trans i'm going to i'm i will i will say the following with about 90 confidence the binary data as it goes over the wire is different however if you tap if you attach for example p um pg receive all right on the other side and wrote that that would be binary identical right um heaven for fend you do anything that expects that that has that as a precondition right yeah that's i hope we all understand this is for our use our technical use but not you know i'm working on back it would certainly be suitable for backup yeah i mean there's nothing it won't be um you know it won't be uh corrupt it won't be corrupt in the sense that you can't recover from the right i just meant i yeah i wouldn't i you know i you're you're you're gambling on a feature that that could change it any second okay yeah yeah well it's it's it's pragmatically it's the only one because you don't have you can't create a secondary on rds that replication something that's not an rds machine so you're going to have to use trigger base replication you can now well if you are using it or if you're moving to a version of actually now if you're moving to 10 or 9 6 11 or higher on rds it can be a logical replication target so if your source database can support logical replication out you're okay how old is old 9 4 well 9 4 you still you have p you pgeological adoption 9 4 and later pgeological adoption 9 3 and earlier trigger based replications all you've got if if you're playing if you want to keep a failed back window open we would probably use for that is we set up a car to one way tear it down and set it up backwards during the switch over window then when you then then when you switch over you can let it go and then when you're and then fail back if you need to did what i just described make any sense i would just run in one direction at a time because there's no reason to pick up the card will sort it out but it's create a lot of overhead on the target that's not necessary so use something like i'm sorry well it's a um it that it's all um everything's been evaluated by the time watch will be putting plug-in plug-in sees it so if you use current date if you use now for example it now will have been expanded into whatever value now is and that timestamp will be presented to it it won't it won't be set in as the as a now function which is then passed over and evaluated on the target you need um pgeological has to be on both sides it's all on both sides there wouldn't be any special setup i mean you you would need to create that you create a subscriber on each start on the target machine and a publisher on the source on the source machine but there's only one subscriber which would be the data warehouse right okay okay thank you very much hey let's get started odd hush falls over the room all right good afternoon my name is jonathan cat um i'm here to talk about operating post-credit scale with kubernetes sorry i get this little feedback i'll talk i'll talk quieter um yeah this is actually my first scale i'm very happy to be here uh joe conway has been trying to get me out here for years and finally we found the right excuse so yeah i'm excited especially because i just came from new york where it is literally half as warm as it is here and so you guys are complaining about it being cold i might come on really so yeah so other than being a new yorker i currently work at a company called crunchy data uh it's the director of communications on previously i was in engineering leadership roles and that's actually how i got involved in the post-credit community because i'm a post-credit user and i love post-credit so much i just want to talk about using it you know in all the different ways and for me you know just the way this talk came about was actually how i learned how to use postgres more from an operational perspective as opposed to my usual like hey i'll write an sql query that's you know 100 lines long and like that i'm very comfortable with you know deploying postgres so yeah um that's a little bit about me you know quick wherever about the company crunchy data we love postgres sorry this slide was originally for a 16 by 9 uh layout so i had to quickly adjust it but um basically crunchy started out dealing with how to deploy postgres securely then you know focused on how to do that in the cloud and we love open source like we know all we do is open source we contribute a lot upstream and yeah it's cool and we like hippos and usually the hippos like cut off and it looks kind of cool but again we're we're gonna levitate here because it's a slide layout but of course what we're here about is not to hear about my life we're here to talk about postgres and postgres in containers so quick show of hands how many people are running postgres locally in a development environment with a container cool how many people have at least tried it out cool how about in production are you doing something with kubernetes with postgres cool sort of yeah okay yeah well it's kind of newer how about in general are you doing anything with kubernetes in production yeah so this talk is basically going to be how do you get postgres up and running with kubernetes and you know basically in a production environment and like inherently that means you're doing something at scale because you know not to jump too far ahead into the talk you really are trying to once you're operating at scale that's really when you want to embrace them like kubernetes some sort of you know orchestration platform um so this is going to be through the lens of yeah yeah well it is what it is uh so this is going to be looking at you know running you know postgres you know through containers you know with and how to do it with kubernetes and sort of look at the trend of what's happening when i first started giving this talk it was more about you know why would you even do this and we'll talk a little bit about that but lately i find you know the question shouldn't be more how and like that's really what we're good dive into is like how do we successfully do this but we just thought with why why would you run you know postgres in a containerized process because it's cool you know that's that's that's the trendy buzzword right now well maybe maybe but actually there are several advantages to it and you know i like to think you know this is my developer hat on you know i'd like to think it makes it so much easier to set postgres up and distribute packages for developer environments my old world you know because i did love postgres so much i would like try to get all these custom extensions in our production application and for some of my developers who are less postgres focused they could be kind of tricky to set up and then we upgrade to another minor version and would break the entire environment and you know we did have like a vagrant you know vagrant box set up and we're able to sort it all out but you know this still would be some pain but you know with containers suddenly you can just have your image package it all together and like whoop here you go all your extensions you're good to go like it's like nothing ever happened so that's nice um you can also separate out secondary concerns um what are my paranoia of you know running additional applications besides postgres is that they are secondary concerns i'd like to try to optimize my postgres server for just postgres all this is is postgres try to take advantage of every you know every uh you know iapp available every you know cpu cycle available but you know there are certain things that are very important like monitoring even more important backup and being able to separate out separate out those concerns uh it's you know certainly helpful you know from administration perspective also you know all you know to play buzzword bingo we want to be elastic we want to scale we want to provision new replicas and you know distribute our workload across our entire cluster we also want to be able to destroy replicas because now we're wasting cpu and i don't need to you know fan out my right sorry my read workloads anymore um so yeah there's certainly advantages especially if we're able to like have these things in packages and spin them up and down as needed but it also introduces new challenges so you know database you know i like to say databases are still databases and you will always be bogged down by one thing how fast is your disc from there you know from there you can work backwards but you know this is where in a day base administration is understanding what is the underlying storage environment i'm running on i could you know i can fan out my read workloads to a hundred different replicas but if i'm running on like the you know basic cloud storage it's going to be slow no matter what you do there's also a lot that goes on inside of postgres you need to tune that postgres.com file particularly for your workload and there's a lot of settings in that and you understand you know what makes sense if i have a read heavy workload i might you know tune it differently than a write heavy workload always keep fsync on though because well if you if you care about your data keep fsync on just side note um what about by managing you know you know certain things are easy by managing one cluster or three clusters or maybe even 10 but think about managing hundreds and thousands that's where you know you know dev ops and i know because i'm a developer so i do dev ops but some people don't do that but that's where you know all those different automation strategies are important um you'll find this in any database environment so we can definitely take lessons you know from running postgres in virtual machines and you know even running postgres on bare metals that we can apply to running postgres in containers so that's the introductory spiel let's have some fun let's actually what we're going to do is we're going to deploy something that's going to look like a production postgres environment but what are some tools that you need to run postgres in a production environment well for one you need postgres right that's that's very important and you know what we did was actually at crunchy we open source a bunch of uh container images that are important for you know doing exactly that running uh postgres in production uh we package post gis with uh with a postgres image um how many people are familiar with post gis okay for those of you who are not post gis is the technical term is a geospatial database extender basically that means you can have maps in your postgres and you can do a lot of really cool things with maps uh for instance paul rancy who's uh i guess the one of if not the lead contributor to post gis he shows you how to find every single starbucks in close proximity with one simple query with post gis um connection pooling this is very important because particularly with postgres current model handling connections which is one connection to process you might want to pool your connections together so pg bouncer and pg pool are tools that do exactly that backing up how many people take back up to their database which is the nice question of asking how many people don't take back up to their database you don't need to raise your hand it's fine but you should always take back up to your database in some way shape or form pg backrest is one tool that it does exactly that you know monitoring uh pg monitor which we'll see shortly and the pg admin 4 which is a the user interface for um i like using the command line for some people who don't pg admin 4 is a very nice user interface of doing that so yeah so let's uh i've been talking a lot i like we're so close to just getting started with the demos but the idea is that with these containers right here you can basically create this environment which is a highly available disaster recovery safe environment that you can monitor that you know you can you know take your backup that you know you can actually you know read you know sorry fan out your reads and and the like which is pretty nice so how about we create that right now would that be cool yeah so sorry 16 by nine slides so basically what we're going to do is we're going to do this with docker um to start and first work what we're going to do is we're going to provision a postgres instance on our local machine and by our i mean mine so i have a little recipe which i will make the slides available after i will yeah okay so we'll do it in here so here's what we're going to do because there's lots of stuff to get to i'm going to hand wave some of this so first um actually just some important things to know when working with docker because we're going to need all of our containers to talk to each other what i'm doing is i'm creating a local bridge network because by default the docker network does not let the containers discover each other on the network which makes sense in a production environment because you might want to lock down your networking rules in a development environment quite honestly i don't care because i just want to access it which of course makes it easy for development but then you gotta remember you gotta lock it down in a production environment um you know using our documentation i created a database um secure password yep uh because otherwise i would not remember it and you know lo and behold it should work yep it's up and running i'm going to call this whale with my secure password and we're there if i create a table let's do uh so i like to use the new method you know since postgres 10 we introduced something called identity keys which is the new serial for those of you who are familiar with that um that's the sql standard way of creating a sequence which is what we all use for our surrogate keys so i create the table so i know out here we have what we have uh what beluga no beluga actually i've seen beluga beluga's uh you can find them on the east coast in canada um there's blue whales out here there's uh humpbacks pretty good and sure enough so let's play a fun game what happens if i stop my container well let's make sure i'm not lying and i started again as you know you know this is whole thing with containers that you know the idea is that you bring up your container you do a bunch of work and then you destroy it and all that work is gone so is our data still going to be there how many think the data is still going to be there yeah it is yes it is because i set up a volume and that's something very important to keep in mind particularly when we start looking at this at scale is that you know the traditional containerized workload or it's the idea of a web app something that's stateless like i don't you know i make a request you know i do something i go to database i return a response when that when that response is over i don't care it's done i'm a web server i did my job i'm trying to be trying to return things as quickly as possible but databases are different because you want your data to persist after your containerized process goes down otherwise well unless you're in a test environment then maybe you don't but for everyone else you want that otherwise you're going to have a really bad day and we don't want that so let's continue having good days let's set up pgadmin 4 how many people use pgadmin 4 cool so i believe we put it on port 50 50 oh cool and it's responsive that's nice well typically i you know again typically i'm running it at a wider resolution so i think the password is secure password so pg i mean for it you know it keeps its own metadata for you know basically storing access into pg i mean for itself which fun fact isn't an sql light database i mean sql light has its purposes i'm not i'm not throwing shade so one of the nice things about the way we configured our containers that i was able to set the hostname of postgres so that way within our docker bridge network it knows defined postgres now normally i throw in a little trick here because usually i'm running you know postgres locally on my machine on port 5 4 3 2 so i bind to port 5 4 3 3 and then when i do this example i then put in port 5 4 3 3 but it fails because internally in the docker network it's actually still running on port 5 4 3 2 so i didn't do that this time because i just happened to not having postgres running but um i do like to throw that little caveat out because that has that can easily bite you maybe we're not responsive okay let's see if this works yeah we could zoom out there we go yeah no it's still it's the modal yeah there we go zoom out yay this is why live demos are fun because anything goes even you're gonna rehearse it you know again you know you should try to rehearse before before i do it but again any you know anything goes um that's not what i want well clearly i don't use pgadmin 4 that much as i said i'm a command line junkie but essentially what i want to do is i want to run the query and show that it works you know we can still return all the the data from the wales table which we can um and you have all the other niceties of pgadmin 4 now the next step on setting up this environment um is monitoring how many people actually monitor what they do in their development environment that's good but why why why do you if you don't like why why don't you monitor things you know in your development environment is it hard to set up do you not think about it i mean it's fine but i'm sorry yeah yeah and especially you know if you're if you want to benchmark something you know if you really want to see how your query is performing over time or you know you're trying to you know get a gauge of that it is important to monitor because you know it's not just how quickly it's your query performing so with three steps we're going to set up a monitoring environment so first we got to create a metric collector then um basically so the metric collector is basically going to run important queries that pull the key information out of postgres that we might want to look at um you know things might be open connection how many how many transactions are being committed if i if i have replicas yeah how far behind are my replicas from my primary particularly if i'm running asynchronous replication and things like that then we take those metrics let me dump them into primitius primitius is uh sorry way to describe like it's essentially like an event collection database you know time time series database if you will i don't i don't know if i like that description but it's good at dealing with that streaming evented data and then we can visualize it with grafana which is a very nice customizable graphing tool like i remember what my password is so we set up grafana here i think it's this no i think it's this yeah no third time's the charm yeah yeah so sure enough we have our database and it's kind of boring because it's idle it's sitting idle right now so let's do some stuff i had some benchmarks written out where did they go here we go so i'm gonna use pgbench so pgbench is a tool for uh benchmarking so we're creating some data all right and see it this is this is actually a boring example that that that took too quick but we see that our data oh there we go i have the refresh rate too low what do you see we're monitoring that's kind of cool we see our database grew in size you can see you know how our transactions performing and then we can run an actual you know we can run an action so this was initializing the data you know then we actually want to run some transactions against the database sorry and again i'm using my cheat sheet so let's say okay we're going to run two you know two parallel jobs let's do it with 16 connections you know each trying to do that's kind of a low number let's say 2000 transactions let's see what happens you know i'm not as concerned about the performance i more just want to see what's happening you should know if we see we have a bunch of transactions happening we don't have any replication lag because we're not replicating right now but yeah so we also see you know here's another important thing when you're monitoring postgres um your cash hit ratio is my data in memory if my data is in memory i could probably retrieve it very quickly if it's not in memory i have to go to disk and that's going to be slow and you see on when you have a mixture of a read write workload you know your cash hit ratio might not be a hundred percent if you have a heavy read workload you want to be you know you want to have as close to a hundred percent as possible because your reads will be very fast anyway this is just the roundabout way of saying that here suddenly we can get this you know nice production set up in our development environment so that's this is not the title the title of the talk though is how not how do you set this up in in development this is how do we run this at scale with kubernetes in production so let's switch the switch gears a little bit so when do you want to use kubernetes of postgres that is a very good question um you know as i said at the beginning of the talk you know kubernetes you know there's many more kubernetes as a service thing tools they're out there right now but you know it does take a little bit to get set up and it takes a little bit to manage and operate you know everything in the kubernetes world so you know you really want to start considering it when you have the type of workload that suggests it and like you know from my perspective it's looking how many databases are you managing you could probably manage 10 databases on your own maybe even 20 i mean it's a lot of work you want to have all the right monitors and things set up but sorry but you can do it but if you start as you start increasing the number of databases you have to manage you absolutely need automation and you know kubernetes is a platform that helps that helps you to do that but at the same also in the beginning kubernetes was not designed with stateful applications in mind there's a lot more features that have been added to kubernetes to help with this but you know you need to understand a what they are and b how to appropriately deploy them and this comes back to the central question you know in kubernetes you have pods and pods are the atomic units they could be a container or a collection of containers and if your pod goes down theoretically all your data goes down unless you've made something called a persistent volume claim or uh essentially you said like hey i want to ensure that you know i have a i have a volume mounted on this container everything that gets read it or written to this mount persists you want to make sure that that stuff stays there so that way when you re-initialize when you create a new pod it can read from that mount point and all your data is still there that is the most important thing your data still existing so what we did at crunchy is we developed something called the crunchy postgres operator um how many people are familiar with operators cool yeah so operators have actually been around for a while they've been they were developed by a company called core os last year core os was bought by red hat and red hat has actually been pushing the operator technology more and more out a large part because it's designed to deal with uh stateful workloads you know the key you know really the keyword is operator basically there's certain operational concepts that you need to understand in each individual stateful service and it can vary you know things that help you know things that you might need to do to successfully run postgres in production might be different than my sql the concepts might be similar but the exact way that you tweak the knobs will be different and the idea behind the operator is that you're able to you can basically write program to tweak those knobs correctly for each stateful service that you're running um so what what the what the crunchy postgres operator does is it does that specifically for postgres you know looking at things like elasticity being able to provision clusters scale them out scale them down destroy them create a highly available environment you know especially you know there's you know users like myself can be very demanding and we want to make sure all of our services are available at all times so being able to create an environment that actually is highly available is incredibly important setting up disaster recovery setting up you know being able to provision you know exactly the type of uh you know resource profile cpu ram storage that i might need for it and the thing is we you know a lot of us we know how to do all of these steps but imagine having to do all these steps 250 times over you know this is the whole idea behind uh you know automation is that you know i can do all the 15 steps to successfully provision a replica so um how does it work you know underneath the cover so it utilizes this thing called kubernetes deployments uh kubernetes deployments are essentially a way of managing uh based on managing applications you know basically keeping them up and running um the nice thing about deployments is that they they allow for flexibility in certain cases storage classes storage classes basically basically defines you know how much storage am i allowed to utilize how quickly you know how quickly can the discs uh you know what you know what level of disc how fast the discs are um flexibility in operating environments basically which node do i want to run on you know for instance you know my production there might be a stream production workload i want to run on you know my superfast you know on-premise hardware that i bought for it some of it i might want to put in a cloud environment and the idea is that you know we can get this kind of flexibility um also flexibility in you know the database version runtime i might have certain applications that are can only work with postgres 9 5 for now you know i can't get them to 10 or 11 just yet so being able to you know manage you know call it a heterogeneous environment is certainly helpful so why you know i mean the the long story short on the why is that this is no different than using a tool like a builder chef or a puppet you know the idea is that we're just applying this to the container and that's really the long story short you know we can basically create a consistent view of how we can manage process and you know we have the tools available to scale it as well and again you see this in your favorite dev ops tools basically we're just applying this into a containerized world and again you know the idea is that we can create this production environment you know like the one that we saw with soccer but times a thousand and that's really that's really the key being able to have a consistent view to create a very safe environment for running our databases and by the way with a few simple commands being able to create this environment so why don't we actually set up this environment with a few simple commands so first let's provision a cluster so this is the command line interface into the postgres operator and as you see it's you know it's very conversational pgo create cluster auto fail that's the command line to step to basically set up a automatic high availability using pg backrest our disaster recovery tool add some monitoring um well i said uh replica count one we can actually up that to two so we can get both replicas and we'll call it scale 17x so let's run that first let me make sure i import forward i'm running version 351 oh no i called it scale 17 so scale 17 a little less okay all right so as you see you know if we if we look at the cluster output we see all right we are provisioning a pod it takes a little bit of time and by the way so i'm doing this on gke and i'm using like the absolute you know lowest gke setting so everything here takes a little bit of time but that's okay you know your mileage may vary let's see are we up and running it's almost there so actually you know while we're getting up and running let's take a little bit look at the create command as you see from the create command there is a lot of uh there's a lot of options to it and that's you know that's by design um you know we can choose you can choose you know exactly which version of postgres you want to run um you can shoot you know you can label it and we'll see why labels are useful no label that basically says which uh which hardware i want to i want to send it to you know the replicas you can actually you can actually have your replicas run on a different storage class in your primaries um which could be useful like i used you know in an in an old company we definitely saved our best hardware for our primary cluster and are not as best hardware for the the replica cluster let's see if we're up and running yet okay so everything's up and running and if we read through this we've uh we set up um well we've almost set up a pg backrest so we can see you know we can see how we provision things so we basically we you know our database says one gigabyte of storage we're using the standard uh the standard uh memory and uh cpu profile yeah let's run some queries actually actually first before we run some queries let's create a user with a secure password cool so we added a user and actually we're going to need to be able to uh get into that database 17x killing me it's called the user it's the default one's called you i should i should have done different commands there we go yay we have a database we don't have anything in it yet all right easy let's have some fun so first you know there's a few things we want to know about this database like one you know what is its utilization and this is something that is very important because i might need to provision you know i need to reprovision my database you know once you know my disk is starting to run out which certainly happens in the cloud world so i can do that with one command i see all right i've only used two percent of it that's not much but now let's use let's do some of that pg benchmark stuff which again actually might actually have this command all right so again as i said i'm using the stock gke setting so okay that wasn't so bad but this means that our utilization should have changed yeah we're up to a whole four percent and that's pretty cool because we can monitor that over time maybe the other thing too if we want to test up time you know just for completeness sake okay everything's running all my users can connect etc but basically you can say well john then this seems pretty boring like you know i can already do this and like yeah like let's start looking at things that scale and one thing that really helps you to scale the management of your databases are kubernetes labels kubernetes labels you know think of them as like tags for a blog like you can basically use them to you know organize no particular content around you know everything that you want to manage and you know if i think about it you know perhaps i have a project you know related to my payment application or project related to my crm or whatever it is i can use that to you know appropriately label my databases for scale purposes so here i'm you know let's say i'm at this current conference scale i add a label to it and i create a couple new clusters for the future conferences you know scale 18x and 19x and this time i get the x in it and what's cool is that this actually gives me different ways of being able to apply commands across the project so for instance let's say i only want to see all the clusters that are available for um you know our current conference now let's say you know let's say i just want to do the ones for the future ones cool here's like 18x and 19x okay but we're just visualizing things that's not that kind of boring but here's again here's where it starts getting cool suddenly i can manage i can manage users across all these clusters you know i just let's say provision 50 clusters i need to give myself access to it well with one command you know basically the same exact command using the selector flag i can say all right add jonathan to all the future conference databases and just like that i've been added and you know to prove it jcat is there but then they realize you know we don't want jonathan involved with like you know these future conferences well we can delete you know we can delete them are we sure yes absolutely we're sure and look he's gone that's pretty cool like i think it's and and again like this this concept is not unfamiliar from you know the you know devox world deployed automation world but again we've now applied this to the containers world particularly in this kubernetes environment so how about ha and horizontal scaling well it is elastic and again i know i'm gonna need to change it so if i say you know pgo scale add another replica and i think you know my typo they also showed some other things too there's a lot of different commands in fact let me digress a second yeah so you can basically see you know everything that you can do um you know you can of course backup restore failover scale upgrade again mass apply upgrades is another big thing too you know you know postgres you know let's say you're running 11.1 11.2 came out a couple weeks ago and you want to upgrade and get all those bug fixes well you can do it across you know all of your 1000 clusters with one command now you might not want to do that for a lot of different reasons but you can and you know that's the beauty of all of this so i think all of our replicas should be up now yeah they're up cool here we go both of our replicas are there and let's say we're having problems with the the primary you know whatever reason you know it could be you know there's there's a network issue which you know we can detect with the automatic hjb set up but now let's force let's force the failover now now how can we do that well first we can query it nope of course we're not gonna find any clusters close to 17x but you know first of all understand which you know which uh replicas we have available and what is their timeline basically what are you know how far have they caught up to uh the primary in terms of having you know all the changes uh applied to them and then once we've determined which one we want to fail over to let's say we're going to do this one oh it voila well we started the failover failover takes a little bit of time in fact you know we can see this in a few different ways so we can see that um we've actually you know if you remember our pod name we've already actually filled over to the new primary and we actually set it up so we can reprovision a replica because of course in an hj environment you know you want you want a certain number of replicas running at all time typically at least two um and like in real in like in a lot of different hj environments you actually want those replicas to be you know you might have two synchronous replicas and you know an asynchronous replica set up or three synchronous replicas and you say you get quorum when you hit two now there's all these different ways of setting it up and you know let's see if we can connect to our new our new database it will keep going while we wait for that so another thing another important thing is setting up your backup policies um so again it's important to schedule backups for a lot of different reasons so one do that but you know typically when I would schedule backup from the past I'd use cron because cron works it's tried and true but you know in the operator world we can basically apply backup policies across you know all your different databases you know in this case I'm just doing it to the scale database but I can say look you know once you know at midnight every night take a full backup and then every six hours take a differential backup basically you know a snapshot of all the changes that have occurred between you know midnight and six a.m. six a.m. and noon etc etc and then you know we could see what that schedule actually looks like and then of course this and then of course disaster recovery um you know you could take you know you can of course take backups ad hoc and you can go in and then suddenly you know drop your most important table like your user table and then you need to do a point in time restore to come back to the time that uh you had all of your users so you can do that too and typically this is you know this might not be something you do at scale this is something that you're going to apply to a single database when you're doing a recovery but you know you have you know essentially a unified interface to do that let's look ahead a little bit so containers no longer new you know kubernetes is matured as an orchestration technology and we see you know a lot being standardized around it and you know people are certainly more comfortable running you know not only running their production services with containers um you know they're adopting kubernetes as the platform of choice to do that what's one of the newer things is this you know this push around operators um basically because operators can capture the nuances of running staple services and if you look at it from a higher level you're essentially giving yourself a caudagnostic as a service when you run an operator you know if i put kubernetes running you know even in different zones in you know in a google cloud environment or i can have a kubernetes running in google and then kubernetes running in amazon you know i basically created my own you know i can basically run my own blank as a service by doing that that's really cool i'm like you know the open source enthusiast to me loves that because it's you know that's that's one thing i always enjoyed is being able to take whatever my workload was and move it anywhere um you know there's always you know people do talk about you know containers and databases like you know what about storage and management and well yeah i mean if you think about it when you know when you virtualize something sure there's going to be a little bit you know a performance penalty that you take when you're you know pushing things out and pulling things from i o but you know if you're running you know if you're running things in the cloud you're already doing that you're just doing that with virtual machines so again your mileage may vary tested out you know i you know i don't i don't necessarily know you know what makes this best sense for your environment in my old environment like i ran i ran my databases on bare metal for as long as i could until i got a virtual environment set up where everything was you know ssd's and flash base you know it was all flash base and it was killing the old bare metal machine i had so then i ran everything in virtual machines databases are still databases too like we looked at all this all this different provisioning techniques but you know you need to tune your postgres.com file for the workload that you're running and you need to understand what is that exactly in that application so you know it's not just you know you know it's just because you can scale things up and down in provision like mad it doesn't mean that you know it's you know a panacea you still need to understand your specific application and what you're doing um in december the federation v2 api uh was released by kubernetes and what that says the idea is that you can basically be running different cube clusters and have them be able to communicate to each other and you know one thing that you know a question that i get asked is well can you take the postgres operator and federate it and the trade answer would know we're you know we're waiting for the staff be available in kubernetes so it is and this actually creates you know more options because while you know essentially there are some network penalties if you're running an operator let's say in one cloud provider but then you have a kubernetes another cloud provider like yes there's going to be some latency there with federation you can basically be running multiple cube clusters in different environments and then have them be able to communicate to each other so that way operator in cloud a is able to manage all the kubernetes in cloud a and operating to be manages all the kubernetes in cloud b and talk i mean looking ahead that's why it's all in this slide um and also monitoring is interesting in this case too because you know what do we monitor now because you know we can monitor our database but then we need to monitor our kubernetes and monitor our operator monitor our environment suddenly there's a lot of monitors that you have to pay attention to and that's just like something interesting to think about like you know how much monitoring is enough like is there too much is there is there not enough like and you know what you know what does it mean you know another interesting thing too is you know looking at your resource profiles you know coming back to that like i know i allocate a certain number cpu with some number cpu's and a certain amount of ram for uh you know running you know my database workload in but then i also got i also need to understand how to configure that with my postgres.com spot so i don't know i have a cool thing to think about so anyway here's the you know here's you know the overall vision is that you know with containers and in postgres you can easily set up development environments that can mimic your production environment you can run your own production database as a service and you basically know as you've seen you know through you know a lot of software automation or deployment automation is that you know we can we can have the same exact tools in a containerized world and with that i break for questions no no i'm sorry i should have said that it's all it's open source oh yeah this is entirely open source yeah sorry i usually i usually make sure i get that in but this is all open source i was kind of wondering if like people were leaving because they thought this was proprietary no like like i said what we do is all open source so it's out there now you can download it play with it etc i will put them on slideshare and then tweet them out and are we can we upload slides to the scale website okay yeah so so that's that's handled by the operator because the operator will essentially switch your timeline and basically switch which you know what how your wall is streaming out yeah yeah it goes yeah so basically it it's a different pvc that's basically a different pvc that's set up for it that it just starts putting it there okay one more question yeah so it depends like so give me an example of a type of operation you're thinking about yeah so that's what you that's who you would do so one thing you can do you know this is this is up with the containers is that you know for instance the postgres.com file you know you didn't see me do any operations with that i mean you can basically mount the postgres.com file you want to manage and then you can edit that from there um if you need to if you need to do let me look you can always you know you can always connect to the pod so you could set up you know a port for to say all right i want ssh into it and then you get a sensation of the pod and you know do what you need to do but keep in mind as soon as that pod goes down and you know anything that you any tweaks that you made to that are are gone so um it yeah so the long story short is it can get a little bit complicated in that regard but so yeah and you know you can always get into the pod the idea is that you know because it's running with the running as a service you can set up whatever port forwards you need to go into the pod and you'll handle that i'm sorry yes yes it's a very there's a giant crd file to when you can basically add whatever you know crd's you need to do cool well thank you