 I am Srihari. I work at Nalento. We are a software cooperative based in Bangalore and I really like working here. So today I'm here to share with you my experiences in building and instrumenting a Postgres cluster ourselves. And today I'm going to be telling you stories stories. I'm going to be telling you stories from the time we worked on building the Postgres cluster. Obviously things didn't work and things failed miserably. So these are those stories and I'll tell you what we've learned from them. So in each story I'm going to tell you what really happened and what went bad and I'll tell you what we did to get things up and running really quick and then what the root cause was and how we dug into that and how we learned from that and ensured that that doesn't happen again. But before I tell you these stories it's essential to establish some context around why we built things the way we did. So the business is essentially that of an experimentation platform. So this is a microservice, a service that other services talk to. So it's consumers are machines and machines can call us very fast and they do. And what's important here is that the experimentation platform was used to drive decisions based on data. So they would decide which product to use amongst two vendors or which algorithm to use amongst many algorithms based on the numbers that the experimentation platform gave them and these had to be accurate to the level of the seventh decimal point which was kind of bizarre but true and we had to ensure that our system gave them that accuracy and it needed to do two things as an experimentation platform. So the first thing that it needs to do is tell people which bucket they belong to in an experiment very abstractly and another thing so that's the real time service and you have the API calls and the other hand you need to report between these buckets of an experiment and say that this thing did so much better than the other. That's very abstract. It's a lot more complex than that but I'm just going to leave it at that for now. And I'm not going to go into a lot of detail here but I do want to justify that for us data integrity was primary and PostgreSQL is amazing at its storage and we needed a transactional relational database that had storage as one of its primary concerns and PostgreSQL was it and most of our requests to the database need to be under five milliseconds or so and PostgreSQL could give us that and like I said the reporting needs were such that we had to think about a medium-sized data warehouse about a few TB's and ensure that we could report over them. So this is what our numbers look like right now. So we have an SLA to say that most of the request 99.9% of them have to be served under 10 milliseconds and we have a transaction rate of around four and a half thousand per second. This is actually not something amazing or in the big data realm or anything but I would say that this is very sizable for a real time transactional database that is about 600 gigs long and our biggest table which we report on and write to quite often is just above a billion rows and all our databases run on EC2 on I2O2X large instances so we have it backed up with provisioned IOPS. With that I'm going to tell you my first story which was in about February 2014 and this was us re-architecturing the system a bit and trying to understand what we can do to make it more resilient. So we sat down and we were looking at different solutions to enable high availability for Postgres. So here's the thing we needed a read replica because we knew that we need live reports which meant that the reporting machine needs to get the data just about the same time our transactional database also got it which meant that we could use synchronous replication. So at this time RDS did not have support for Postgres read replicas so that was straight out of the picture but there was another thing that we couldn't be down for more than five seconds and down I mean that the service could not be down for more than five seconds and it was bottlenecked at our DB and so that meant that our DB cannot be down for more than five seconds. So this meant that we needed what was called automatic failovers. So now RDS has read replicas and it has multi-AZ failovers and such but it still doesn't cut it because five seconds is too small even for RDS. So we needed that and we knew in Postgres land there must be something that can help us here. So we looked into a few tools and these were the tools that were there at the moment. So one of the more popular tools there was PG pool and it did a lot of things and that kind of scared us a bit. So being new to this what we wanted to do was ensure that we had a firm grip on everything that was happening and a good control so that if anything goes wrong we'll be able to like dig in and fix it and PG pool did not give us that. It was so much abstraction that it posed a black box on us and that was not acceptable. Bucardo did something slightly different actually it did something quite different so it did multi-master replication and asynchronous replication we needed so that was also out of the picture but rep manager seemed like a win at the moment. So it did one thing which is manage replication and do automatic failovers and it seemed to do it well and it's written by the second quadrant team which is very respected in the Postgres skill community. So we looked at it and we were able to make it work. So this is what our cluster looked like at the moment right. So we have rep manager demons running on each machine. There's passwordless SSH access between all the rep manager demons so that they can talk to each other and we have a master DB and synchronous replication between the standby and rep manager maintains its own table inside and this is sort of its list of nodes and their priorities and what the statuses are at the moment and just to give you an understanding of how rep managers used this is basically it has like a few commands to say clone the standby or promote this standby to a master or do a switchover so it's that simple. So this is what our setup looks like right so we have an application that's load balanced and it's all talking to one database and one of the applications is the reporting service so that talks to one of the standby which is a reporting DB. So all this was fine but we had overlooked one thing right and what we had overlooked was that while rep manager does handle the failover within the database cluster itself it does not handle it does not tell the application that a failover has happened so this was something that we needed to do ourselves right and turns out that this wasn't that hard at all so consider that you have a failover right so right now one standby went down and one your master DB went down and your standby became a master but your application still continues to talk to the master that went down and rep manager has a hook right so it says here's a promote command and I'm going to run a script when I'm going to promote a standby to a master and you can do whatever you want within that and so that was simple so what we had to do is just send an API call to the machine saying here's your database configuration now point yourself to the new database and that was fairly simple and it seemed a bit raw but that was fine at the moment and we added another line of defense because we thought it was necessary right so the second line of defense was that the application near bother cluster this was a conscious choice and the application could then go to each machine and ask it are you a standby or master and sort of pull that as well so if the cluster configuration had changed and push from the rep manager to the data to the application had failed this would still work right so this was a nice second line of defense and with this we were confident enough to go to production we did a bunch of tests locally and integration performance and then we went on went to production and Philip our head of engineering said do a failover and showed me that it works and the first time it did not and it did not the time after that but we knew enough about rep manager we could like go into the code and understand it enough that we understood what was happening and could fix it so that was really nice and after those two tries third time we got it right and we got it right the fourth time and after that so things were looking really good now in hindsight that does seem a bit raw and this seems like a more straightforward way to do it right so you have a virtual IP mechanism that's facilitated by you car for something or even your cloud managers virtual IP mechanism so you can tap into that and then only your VIP mechanism knows about the cluster and the failover and your app can remain ignorant about it and say here's the IP just talk to this database now just talk to this IP now and that has its own flaws as well and get to it but this seems like a senior approach and then this happened one day right we got a notification on our phones that hey your AWS machine went down and we were like shit this is our master database and we should do something about it but we went and looked at our consoles and turns out that we were fine so a failover had happened an actual disaster had occurred and the application was talking to the right database and everything was smooth so this was our first success after setting all that up so what have we learned from that what we've learned that rep manager works right and it does one thing it does it well and the code surprisingly readable just like Postgres written by the same people sort of and we can use a push full strategies or a virtual IP mechanism to do the to handle the communication between the application and the cluster and we should be aware of the fact that AWS can drop our boxes and maybe test that even and you can also test your failovers rigorously so what we did was we had a vagrant setup and we had a few tests that we would run locally and like sort of randomly shut down a node and ensure that we sort of covering all kinds of failure scenarios we had tests to run these things locally we didn't go so far as to put this on CI but I don't think that's impossible to do and so with that I'll tell you another story so one night Philippe again head of engineering pokes me on hip chat and says hey wake up your service is doing really bad it's failing a silly and you need to fix it immediately right so I wake up and I look at it and sure enough where we is like we were supposed to be at 99.9 percentile under 10 milliseconds we were at 45 right so that was like a catastrophic failure and we had to do something to fix it and usually something like that instinctively I imagine it would be related to a resource failure of some sort and sure enough I looked it up and it was the disc but it was at 80% now I know that we had let it grow up to 80% and that was a problem we were trying to fix at the moment but we didn't expect it to fail at 80% we thought 90 maybe and that even we were only thinking about Postgres and that should be fine but this wasn't a time to think about that so we need to get things up and running so I got on a call with Fleeve and he was like okay we can lose some data but we need to get the service up and running and we need reports so we can't lose data we can't lose reports and this was actually not very difficult to handle at all because we had a cluster right so what we see there are four databases that have the same data now everything is at 80% because all the discs are the same size but they all also have the same data so they're all live backups are all the three standby here are live backups so if I can just take one standby out of the cluster I can do whatever I want with the cluster and the data is still safe there and so that's what I did and that's actually quite easy what you need to do is just stop the database tell rep manager about it that I'm removing it from the cluster and remove a recovery call that tells it that talk to this master and then start it by itself as a standalone database and we have this that the standby node is now out of the cluster and the cluster is still at 80% but it's isolated and our data is safe there and now we're free to delete the data from there right so I can trunk it a table or something and the requests are true and I can get reports from there at this point we are up and functioning right so we've lost some data on our runtime database but I can serve reports from there manually or whatever and the SLA is met but back to that question I thought it failed at 90 y80 so turns out that it's DFS right so somewhere buried in the ZFS best practices is something that says don't go above 80% it's probably bad for you and then we dig it up and then it says like there's defragmentation issues and everything and like that was a lesson learned for us like 80% is critical and we'd never touch that but there was another problem with this right I said that we were trying to fix it so we were trying to delete data from Postgres all along we reached 80% we thought oh I can delete hundreds of thousands of rows overnight and that should free up some space but it hadn't and then it stuck is like right on the face that Postgres implements MVCC which is multiverse concurrency control and everything was obvious after that so what that implied was that when you delete or update a row Postgres just basically turns a boolean on the row that marks it invisible to transactions that follow from there but the row is still on the disk it's still in memory and how it handles this is have a GC like auto vacuum process that later sweeps through this and removes the dead rows and even when it does it only reclaims it within Postgres the data is never almost never reclaim back to the OS so here is a rough state diagram of how this works right so you have an insert becomes a live talk and then you have a delete that makes it a dead tuple and there's a vacuum process that makes it that pushes it into the FSM and then if a new row comes in it reuses it again so for the most part that is a cycle there that you'll see and Postgres is basically keeps its space within itself and only in the rare cases where there's a vacuum full or a table gets rewritten that you will have some space reclaimed to the OS right so we need to be aware of that so here's a snapshot of some of the dashboards that we built after this incident so that's the DB size and you'd see there the last two columns are sought with waves which is very similar to what you'd see on a JVM GC right and so that's important to monitor so what have you learned we've learned that standby is a great live backups and we've learned that never go above 80% on ZFS and tune your auto vacuum workers aggressively so for big tables the standard auto vacuum configuration is very poor so you have to actually test it out ensure that you're using the right amount of disk IU and all that and that's a long process you must do it in an environment that's very close to your production environment and to ensure you're getting all the data you want at all times you have to monitor your disk usage your debt tuples ensure that you have those your GC wave sawtooth and the auto vacuum process itself that brings us to another story so at this point in time our service was about a year in or no even lesser like about seven eight months and it worked so people wanted to use it more so more people put up more experiments and they wanted more reports and so what we needed was a better reporting box right we only had four cores or something on our reporting box and we need to make it bigger and so we had to do what would normally be considered a routine operation where like add another load to your cluster and that should be simple and we had done it in the past for other instances where we had to add another standby and everything was smooth so far and it was even this time so I started off the clone and it said I have started receiving wall and I'm transferring at 40 Mbps look at Jnet top or whatever and it says I'm receiving data at 40 Mbps and I can see the desk just getting filled up and at the rate I thought that next morning when I wake up maybe five to six hours later it should be fine like I should have my new reporting box all caught up so the next morning I woke up to this right I see that requested wall segment has been removed okay so I didn't understand this at the moment I understood wall wall meant right at logs and I knew this had to do with the replication but I didn't really piece them all together so let's let's try to understand that just a bit so when you start a clone a rep manager by default runs a PG base backup right so this is like a snapshot at some point so if you run a PG dump what it does it takes a snapshot of your database at one point in time from there gives you the dump so PG base back does something very similar but what we have to keep in mind is that the database continues to get data while the backup is getting created right so that data that comes into the database while the new clone is happening is written to the wall right and this is how postgres replication really works it's just walls getting sent from the PG X log directory of one machine to the other and how this happens normally in a clone is that after the PG base back completes the wall recovery starts and this is basically where the standby the new standby starts streaming the wall data from master and it can't start up until it has the database can't start up until it has fully caught up with master so what we had done that previous night what somehow over flown that wall right so there's a limit to how much data you can keep in your PG X log and we had said that using a wall keep segments configuration to about eight gigs because that was more or less the amount of data that we got in a day and turns out last night we had gone beyond eight hours of eight gigs of data and that wasn't high enough and it had over flown and then the standby was looking for some data on the master which didn't exist anymore but what we had there was a 600 gig database almost that was almost caught up to the master and like about 10 gigs or something missing and rep manager had this our sync only option which meant that like I can just sync the files and I was like okay the rest of the 10 gig should be easy to transfer over and I waited waited for two hours three and it didn't happen and it was I could still see a hundred percent CPU and nothing on the network and digging a little deeper what I saw was that it was doing an arching of the entire data directory right it can't do just the wall and it was running a checksum on every file right so it has to do a checksum on every single file of all the 600 gigs before it can tell that this is the data I need to transfer so that basic and then I look in the rep manager docs somewhere and it says this might not essentially be faster than doing a fresh base backup and so that's what I did right I made the 8 gigs 80 gigs so it could hold like 10 days data or whatever and so I restarted the clone and given a light traffic day everything would work just fine and it did but how do we ensure that this doesn't happen again turns out this is fairly standard practice almost so there is a concept of wall archiving so what you do is as you receive walls you send them to another box all together right and call it the wall archive and post just has a way for you to say that this is how I get wall data from an archive and what it does is if it can't get the data from master it'll try to get it from the archive right and that was as simple as that there's another way to do this which I found more recently and that you can just pass a flag to PG base backup and it will stream the wall data as it comes in paddling so this implies that you lose one more slot where you can add a standby but in the scheme of things that's rather cheap and this is present since post is 9.2 so it's a really neat option there are downsides to this as well and I can talk about them later but yet another way and a way that we use often is to use file system backups so we use ZFS on production and we take snapshots and send them incrementally to a backup machine this is very fast and cheap and there's no reason why I wouldn't do it or anyone shouldn't do it and ZFS send tends to be a little faster than a PG base backup so PG base backup completes in six hours this will finish in five and a half or so so it's that it's a little better and it it also transfers the wall log the wall directory with the master file system so you get a little more data when you clone and after you have the snapshot of your file system on master on your new standby you can just let it recover and that's fairly clean as long as you assume that your file system snapshot itself is clean and then you take your file system snapshot after doing a Postgres checkpoint yeah so what have we learned we've learned that we should think about wall recovery right and wall recovery is a part of clone and we can prevent against wall recovery issues using wall archives and our syncing it's low but it's still an option and file system backups and the things that you want to monitor while adding a new standby are network throughput and db load so that you know that you're not getting too many requests and run out of wall and in general the disk I that brings us again to another story so this was one of the stories where the issue was not that difficult to understand but was more difficult to fix so we had a bunch of 500s on our reporting service and people were trying to get reports and we were not able to give it to them and we had to manually run reports and give it to them that was bad but looking in the logs that's the error that postgres throws it says conflict with recovery right and we kind of knew this and we kind of knew how to protect it but not really like we hadn't encountered this issue as deeply as we did then so we were like okay they're running too many queries let's try to understand this a bit so the mastery database is going to get small transactions right so basically scattered reads and writes but the reporting database gets a long query so reports that run for 10 minutes or something excuse me so while a report is running the wall recovery is also happening at the same time right so consider the scenario of say this is April you're getting reports for April right and say you're running a report query for this month and the master also is receiving requests that is changing the data for April right so it's not that postgres can't have writes that's what NBCC is about like you can read in parallel as writes happen and that's fine but in postgres replication it's it also replicates maintenance operations like vacuum so basically the wall there also contains the information necessary to remove rows from under the query right so that is not acceptable and so basically what it does is it pauses the replication pauses the wall redo until the query finishes and it says that I will let you run the query for a maximum of so much time right so that max standby delay there you see is dependent on source so you have one for archive and one for streaming you can say how much of each you're willing to wait for so when it goes beyond that what it does it says no you can't run anymore I can't let you fall back I will cancel you so that's this error that we get when it does that and it says cancelling statement because you're conflicting with recovery and so that goes into the opposite state right where your reporting service is shut off and the wall redo continues so if you try to connect to your database then it will refuse connection and say that the database system is in recovery mode you need to wait and so this was what was happening and we need to fix it really quick and also fix it forever but the immediate fix was to increase the delay right so we were on AWS we didn't have network issues at least to what we could see and that was fine so it was like we can afford a little room there and we said for ten minutes or so it's okay to run a query and fall back but what are ways to fix this properly so that never happens again suppose this has one such mechanism so assume your reporting service is doing a bunch of queries on your reporting database if you use a hot standby feedback okay I'm just gonna ignore that so you can use a hot standby feedback configuration to say that your reporting database should send data about what queries is executing to master so the reporting database can basically say I'm running queries on April data don't vacuum April data and master be like okay I'm not gonna do that but what that means is that it will not send wall data for that and you can run your queries and that that works fine but you don't basically run vacuum on those rows that means that your database will bloat right so you don't clean up and we were having issues without purging and so that was not really an option we had enough bloat already and so we had to look at another way to solve this and one of the ways to solve this is to usually make the report queries themselves faster and we did a bunch of query optimizations I'm not gonna go into that but a few questions came back to us right and we were like why aren't we doing partitioning why aren't we using a star schema for our reporting database isn't that how you do reports normally and we can't right so if we need live reports if we need synchronous replication and postgres does postgres sends you binary data you don't really have much room there right you have to use the same schema as you do in master and that was an option that we refused to take because the API service was critical and we could not afford to change the transactional schema right there are other ways to do this now and I'll talk to you about that in just a bit but while you can't change postgres replication you can change what's around the database right so you change the hardware or you change the file system so that's what we did we made it a bigger box like you can change they can take the same database it gets the same data but you make it faster from under it so reporting databases do a large like chunky IO so you can optimize for that so higher IOPS definitely helps the other thing was ZF is record size right in a transactional database you want your file system record size and your database record size sort of match so for postgres it's about 8kb and so we tuned our ZF is to say you can write 8kb record so our well TP is totally fine but on reporting like reading 8kb at once is super slow where we want to be reading gigabytes of data so we change that to about one MB and it was already 30 times faster right so that really solved most of our problems right there and of course like if you're running 10 minute long queries on a single database that's gonna take up a core and so more cores really help so what have we learned from here we've learned that queries might be cancelled so handle that don't throw 500s to your clients and you can apply back pressure it's an option but you might not want to do that if it causes too much bloat and you can't use different schemas maybe and you can change the file system and hardware around postgres without affecting replication and those are some things that you can monitor to ensure that that doesn't happen but we don't have a projector cluster I told you I can still share more stories with you and talk about them later but I just want to give you a bunch of scenarios that we prepared for that we didn't really encounter in production just yet but they're still quite interesting so once a scenario is a split brain and you might have heard about it that's when you have a network partition do something bad so consider there's a network partition and your master goes away right so a standby comes to your master and you have a proper failover so your app is talking to the new master and your old master is unreachable right and at this point after a failover has occurred you have your app manager table say that one node is failed and the other two standby is appointing to the right master and everything's fine now what happens if the node that went down came back up right so the network partition somehow resolved itself and it's again reachable so at this point you have two masters what do you do right your rep manager table says that this is a master this is also a master so the standard response in such cases is to shoot the other node in the head and yeah and and some some systems automate this rep manager doesn't and like if you actually notice the cluster is totally fine the applications talking to the right master database and no stand-by's knows know about the other master database everything's fine it's just that the other master exists and you just need to kill it so another such scenario could be when your application does not know about a failure failover right so while a rep manager does the failover correctly what if this push fails right you're making a network call and like that got interrupted or say your promote command had a bug in it and it wasn't able to send the status right what happens then and even with even in the VIP strategy right what if you have so rep manager does a failover in itself and it still needs to communicate that to the virtual IP mechanism somehow what if that fails so one of the ways to prevent that from happening is to have multiple lines of defense and our push pull strategy is one of them but it's possible to think about other such scenarios where you have an external system monitoring this and ensuring that like there are multiple ways for you to know about failover and we've actually covered a bunch of these backups and let's like quickly contrast them and see what we can use where so a standby is a live backup right it's probably the best you can use and it's the same size of the DB and it's immediate and you have all the data in the master there all at all times so that's great but the problem is that if you accidentally delete a table like you can connect your test to your production database and you immediately like drop a table or something that has happened and like that will not protect you against that and wall archives are great they are replayable right like you can replay a wall archives right from the beginning of time to now and it would be all caught up so that's great about it and it also helps resurrecting stand standbys when they are when they lag back too much the problem with that is that they can be very big walls are slightly bigger than records in Postgres and so these wall archives would need a lot more space and if you're trying to bring up an entirely new standby this is probably not the way to go because it involves a lot of redo time so logical backups are essentially us doing PG dumps and these are resource intensive so we probably don't want to do them on the master you can do them on the standby but you can think of them as very large queries which means that replication will have to pause and it's possible that doing a PG dump will mean that your standby lags back far enough that it can't recover from it but it's integral right so that's great about it you can take it and you can make it work even across architecture so this is great for a backup that you use to upgrade your database system if you're moving from 9.3 to 9.5 or something this is the way to go and file system backups I don't see why anyone wouldn't do them right they're fast cheap and if you use ZFS or something you can you do snapshots and roll back and these are great for testing so if we do them all the time to test migrations whoops I did something bad okay no problem just roll back and do it again so so it's really useful that way but the problem is that you can learn to an end into an integrity risk so if you had taken the file snapshot not after a checkpoint but in between when a file was half written or something then your database won't start up and that's bad and yeah so that's that and here's one last scenario right so you have a database and say the disk under it degraded right so you're down to 40% disk IO but this doesn't trigger a failover it's just really slow and like returning things in 45 millisecond instead of 10 and possibly one of your standby is can take over as master it has full disk capacity and it can do that but how do you detect something like this how do you ensure that the standby is capable would you have to bring it up as a master to check it out because you have to write into it right would you use a circuit breaker to detect this not really sure so we have thought about this a bit but I just want to leave this out there at something that you can think about and reason because that's essentially what all this kind of work involves and building really in systems involved and that's all I have thank you thanks very interesting talk thanks so but I really like it because I went to some of these phases myself I want to ask a couple of questions one is about the files from backup you said it can get you to integrate the issues why would that happen so if you didn't take a checkpoint before you took a file system backup yeah it's possible that postgres had not written something fully in or something you could have been start back up and then after that you can do arcing back up sorry you can do a PG start back and then do an arcing work fine right when you do PG start back up I think oh right yeah yeah you can recover after an arcing but the file system backup then by itself is like didn't work right you have to get around they have to fix it using an arcing no I'm saying that when you do arcing backup yeah take the whole file system backup either using arcing or you do whatever ZFS thing you are doing right so if you do a PG start back up before you start the process I'm assuming that it will become consistent and then yeah yeah you can do that as long as you take a checkpoint before you start the backup before you do a snapshot you're clean yeah that's I'm saying I don't think you'll get an integrity if you yeah but we have we have taken snapshots without doing checkpoints in the past and we have encountered these issues okay yeah the other thing I wanted to ask you is about the bloat issue with hearts and the feedback I understand why would that happen so what the hot standby feedback does is it says don't clean data right so that it doesn't remove rows from under it while it's running queries so if it says don't remove the data it says don't vacuum this data for not running vacuum causes a bloat that's only for the some period of time only for the 10 seconds to 10 minutes that is true but think about reports running all the time right which is which is the case people keep running ports one after the other okay and this can quickly add up right so one query runs to the other like an April query and then there's a March thing running somewhere there and it quickly adds up thanks I thank you it was wonderful I have a question so you said when you were running queries on your reporting DB right yeah you said adding more cause help but in postgres your queries don't really run on multiple cause right like if you're running one big single query right it doesn't help so why are you talking in account of having different big queries that are running that can our multiple cause I was just yeah so each query will run on a separate connection which is a process right so more cores you have more queries you can run more queries you can not one single query no no no so yeah there we can't parallelize queries at the moment not parallelize query execution what we can't analyze like we can just have multiple queries that running at the point so more cores you have more report you can run yeah thanks thank you okay thank you please don't forget to fill out your lucky draw