 Today we are going to talk about high availability solution for PostgreSQL, we've been working at our company, we've been working for the past two years. So about myself, so my name is Alexander, so I work with PostgreSQL starting from 2001 year to year and I've been touching it from different sites as a DBA, as a system administrator and even as application developer. And last four years I work for Zalanda, so Zalanda it's a European company, it's the biggest European fashion retailer, so basically we sell clothes and shoes in 15 European countries and revenue in 2016 was 3.6 million dollars and 6 billion euros, it's a quite big number. And of course we run our business mainly on PostgreSQL, so Zalanda started as a company in 2008, so it was in the beginning a really small startup, so just a few people were working there and during these years we were really growing now we have offices not only in Germany, but also in Finland and in Ireland, so and a little bit more numbers, so because our business is growing really fast, so we need really huge amount of tech employees to support all these things, so right now we are more than 1,600 technology guys who work in different countries and different cities in Germany and in Europe, so we run more than 150 databases in old fashion data center, so somebody call it legacy, somebody has some different names, and last year so we created more than 130 databases on AWS and actually exactly this decision to move everything on to AWS was a main driven factor to build solution for high availability, so why we choose Postgres? Actually in the beginning Zalanda was running on MySQL and Magenta probably somebody familiar with this software and at some moment in year 2010 it was obvious that this solution doesn't scale anymore and guys made the decision to change platform and rewrite everything, so as a database they choose to run Postgres and at that time it was 90 release candidate one, so brave guys and great Postgres, so and backend was implemented on Java, so they rebuild everything in three or four months and migrated smoothly without any problems, so why? So when you have just one single database, everything is easy, so you treat this database as your maybe your own kid, so you know all good and bad behavior of your kid, so you know some pain points, you know how to treat it well, but when amount of databases is growing, so you cannot treat it as a kid anymore, so you probably consider it as your pet, so you still give a name, so you probably remember some important pets, but not everything and things are really changing when you are moving to the cloud, so because in the cloud you already not treating everything as a pet, so every new hardware which is not a real physical hardware available especially for you, so you share this hardware with somebody else at any moment, so something may happen, so what you are doing with Kettle, so if it's getting sick, probably you will just kill it and bring somebody new as a replacement, and of course because in the cloud more or less you have what you want, so you can have really in the same cluster master and replica on similar setup, so that's why it's important to configure it standard everywhere, yeah like I said, so at any single moment something may happen and instance is going down, so what if this instance running your replica, it's fine, so you will bring replacement, rebuild replica and yeah, but if it is a master, so you need to failover, so unfortunately Postgres doesn't provide automatic failover out of the box, it has everything, it has streaming replication, so you can have a hot stand by, so and as soon as master is going down, you can promote it manually, but there is no really existing solution from Postgres itself, there are a bunch of other solutions, so how it usually happens, you're seeing that master is not available and deciding to promote replica, but how to know that master is not available, you probably will consider to run some health check on replica, but are you really sure, maybe it is some short network luck and master was not available for short time, it will became available, and if you promote such replica, so at the end you have a split brain, not nice, so what came into mind, probably we should use some third party node to make a decision, so this node should monitor master, maybe monitor replica, so and in case if node sees that master is not available, so it should promote replica, but still we have possibility that master is still there, it's just separated, partitioned in network, and what will happen if such node also will go away, so probably we need more than one of monitoring node, but so this node should also agree in between which node is doing monitoring and which running as a backup, so again, we got to distribute consistency problem, and distributed consistency is not an easy task to solve, so and maybe somebody tried already to do it, and it's really hard, so there are existing solutions, and probably it's better to use them, and luckily around two weeks ago, somewhere in United States company Compose did a blog post about high availability for Postgres and did a really nice talk, so they decided to use ADCD in order to implement automatic failover, so yeah, ID is quite good, but implementation had some problems, and at the end, so we really need some new functionality, so we forked the governor, so it's a solution for high availability from Compose, and now we have Patroni, so and we support not only ADCD, but also ZooKeeper console, so basically any distributed configuration system of your choice, so ADCD and console are implementing RAVT protocol, it's a distributed consensus protocol, and ZooKeeper is implementing ZAP, it's somehow similar to Paxos, but not Paxos really, so what are key feature of such distributed systems? If you run more than one node, so usually you have to run once, so three, five, seven, so all writes are happening via single node and write will happen only in case if majority of your cluster will agree, so master will send request that I want to do this transaction to all nodes, all other nodes, so in case if you have five nodes, it will send to remaining four nodes and as soon as it got at least to answer that we agree that write can happen, so you have a majority, three out of five and write succeeds, so basically it solves for us distributed consensus, consistency problem, and also these solutions have really wonderful things like they give you possibility to set some expiration date on either keys or on session and it means that we can use it in order to check whether master is alive or not, so we create a leader key with some specific TTL like for example 30 seconds and periodically update this key, so periodically it means that for example every 10 seconds and if key is not updated during 30 seconds, it will expire and any other node will see that key is not there and it means that we have no master, also the system provide such functionality like atomic operation, so you can create or update key under certain condition for example please create key if there is no such key exists, it's also important when we run leader race and the last important thing we can watch for some keys, for example we are really interested in what's happening with the leader, so and we really want to know that leader has gone away, so we will watch for leader key and if it's not available in ATCD anymore, so we will get notification immediately. So and what we are keeping in distributed configuration store such keys, so config, initialize key, so every node keeps some information about itself, leader has pointer to node, just has the name of the node which is a leader, so leader of time keeps information about current XLock location and failover is used for some, when you want to run not automatic but manual switchover, so and yeah, how values in this case are looking like, so initialize key has inside cluster system, so it's very important to know that when you join in the cluster that you have exactly the same cluster system ID because otherwise it doesn't really make sense, you will not be able to replicate anything from it. Leader of time has just absolute XLock position in bytes and config key, it's a special one, so we can keep any arbitrary configuration of Postgres or patron itself in there, for example, max connections, which must be the same on every cluster node and keys which will expire after some moment if they are not updated, leader key has a pointer to DB node two and DB node two has some JSON structure inside, so it shows how this node can be accessed by application and some URL of Patroni API in order to give a possibility to Patroni to communicate to each other and they have, in this case, TTL 22 seconds, so how we would strap a new cluster? First of all, when you run a Patroni, it goes into ATCD and figures out that there is no initialize key, so it will try to create it and if it created this key successfully, so it will run in it DB. So if it failed to create this key because some other node was faster, so it will just wait until leader key will appear in its ATCD and use this node in order to fetch base backup from there. So on every node we run periodically some event loop, so master updates leader key on every iteration and if update failed, it means that something is wrong, either with your networking or maybe with ATCD, it doesn't matter, so if update doesn't happen, it means that we are not allowed to be leader anymore and we have to demote, so also it periodically updates X log position and update information about itself, so how it's healthy and so on, which this information can be used for monitoring and the last but not least, it maintains replication slots because every replica is connected by using a replication slot, so it gives possibility to master to keep all wall segments which are needed for replicas and what replicas are doing, so they are watching for leader key, so and if key is not there, they will do a leader raise and if leader key is there, so they will check that recovery has correct information and in case if it doesn't, they will write a new recovery config and restart. Also we do support cascading replication, so on every replica, we are maintaining replication slots as well and of course every replica, every node has to update periodically member key because otherwise it will expire and how leader raise works. Every node, so when it sees that there is no leader key in DCS anymore, we'll try to evaluate its X log position against any other node, including former member, including former leader, so and if node considers that I'm the healthiest one, I try to create a leader log and if it was successful, it will promote, so basically this picture shows what's happening. Two nodes, A and B, try to create leader key at the same time and we are specifying that it has to create this key only if previous one doesn't exist, so pref exists false and node A succeeded and promotes Postgres, node B at the same time didn't succeeded because node A was faster, so it doesn't promote and it will follow new master. And I prepared some live demo, so I will use Docker compose to run it, so it's really wonderful tool for Docker orchestration, so here we are, so we are running Docker compose, it started one node with ATCD, three nodes with Postgres and Patroni and one node with HA proxy and ConfD, so as you can see on the screen, this one was faster, just DB node one, that's why it managed to initialize and to run needDB and two other nodes, just run PG base backup and joined as replicas, so this is just output of Docker PS in this screen and what has happened in the node which running HA proxy, here we have one, two frontends, one is for master running on port 5,000 and one is for replica, which is running on port 5,001 and this frontends are linked with backends. Backends will route traffic in this case only to the node which responds with state of 200 on request master and replica as a way around, so it will respond to 200 only if it's running as a replica and let's try to connect to, so here we are connecting via HA proxy and it connected to the master and this query simply shows current transaction ID and some aggregation against PG start replication, so basically we have two replicas connected, asyncronously, but now let's try to kill master horribly, so just run Docker kill and yeah, so we got disconnected, but still information in ATCD is still available, so it knows about master, about leader in this case and it will take up to 30 seconds in order to know DB node two and DB node three to figure out that leader has gone, so yeah, okay, so it's gone away and DB node two was faster, so it did request to DB node three and to the former leader, so in order to figure out maybe former leader is still alive, but for some reason it cannot update information in ATCD and in order to compare it's lock location with DB node three, so it decided that I'm the healthiest one and did a promote, so selected timeline ID two, yeah, and what was happening on this node, so it wasn't that fast and it tried to create a leader key but failed, so that's why it didn't promote it but updated recovery config restarted and joined, so now it is running as a replica but replicating data from DB node two and let's have a look what happened with HA proxy conf, now we see that DB node one is not there anymore, so we can now try to start and bring back DB node one, so it takes some time because replication slot doesn't exist, so yeah, now it's got connected and HA proxy conf again updated and node is back, patronic control shows current status that everything is fine, okay, let's do a failover, control failover or switch over back to DB node one just because we want, so and we do support it, it is asking please point where the current master runs and it's DB node two where we want to failover, we are specifying DB node one and when exactly we should do failover, so we have possibility to schedule failover for any point in time, for example, if you want to do control failover early in the morning when the minimum traffic, yeah, it's possible, so we're answering, we want to failover now and again, so one more question, I'm really sure, yes, yeah, it was fast, so master is already running on DB node one and node two and node three doing some updating recovery conf and restarting, so we also do support smart synchronous mode, so in this case, we are enabling the synchronous mode via patronic API, we are doing patch request, DB node one decided that DB node two is gonna be synchronous standby, so it updated Postgres Conf, did PG control reload and after that it exposed this information to ATCD, it means that if master will die, so only DB node two will be able to take it over and how patronic control output changed, so it shows that synchronous standby is on DB node two and let's try to kill DB node one once again, so the same picture, the same query, but here we see that DB node two is synchronous and let's watch, so it will again take roughly 30 seconds, up to 30 seconds in worst case, yes, so streaming replication is broken in both cases and DB node two should take over soon, yes, it's running as a leader and DB node three already connected, but as a synchronous and just in second, yeah, it was choosing the new synchronous node, okay, that was it for demo and I have a little bit more slides to show, so what do we support? So we support manual switchover and scheduled failover, we support synchronous mode, so we can attach old master with PG rewind because in case your database is huge, a few hundred gigabytes, maybe terabytes re-initializing it from scratch will take a huge amount of time and IO and traffic, so we do support not only base backup, but we can re-initialize replica with using whatever you want, for example, Wally or some other stuff, so Linux watchdog support is really coming soon, so it's important to kill master if kill maybe entire instance if Patroni doesn't run, so it may happen that for some reason Patroni crashed and Postgres is still running, so we don't want such Postgres to perform our transactions, so we have possibility to pause automatic failover, so basically everything will happen the same, but without automatic failover, so you still will be able to failover manually, you can shut down some notes for maintenance or for example, if you want to do a major upgrade and we have wonderful tool, Patroni control, so in order to see current cluster state and perform some actions. Also really wonderful feature, it's a dynamic configuration. We are able to change Patroni and Postgres parameters via REST API calls to Patroni. Patroni will save this information to DCS and to ATCD and all notes in the cluster will apply this parameters dynamically. It's really necessary to ensure that some parameters are the same on cluster nodes, it's, in this case, we have some Patroni parameters like TTL and loop weight. TTL basically shows how long you can live without updating of leader key, so it's 30 seconds by default and loop weight, it's time between every iteration of high availability loop, it's 10 seconds by default. So also there are such parameters as retry timeout and so on. So also it's quite important to keep some Postgres parameters the same on every node. Basically these parameters are taken from Postgres documentation and what's also important, Patroni, make sure that when you change something, it can figure out is it necessary to just reload or you need to restart. It does not, it will not restart Postgres automatically, but it will give you a hint that restart is needed. So and with Patroni we can build some custom solutions for high availability. So we do run Patroni on AWS with Docker and Wally and we do run Patroni in old data center on bare metal. So on AWS we are using ADCD in old data center, we are running ZooKeeper. So and as far as I know, there are some companies running Patroni with console. So it also works and how this looks like on AWS. So we do run EC2 instance on EC2 instance we run our custom Amazon machine image. This image just starts one Docker container. Maybe it's a little bit stupid, but that was our rules of play in company so and everybody has to follow them. But from another side, this case we really isolate Postgres on a single EC2 node. So in Docker we run one instance of Patroni and Patroni starts and restarts Postgres. There is also Wally which is doing continuous archiving and one time per day it pushes base backup to S3 and every wall is archived to S3 as well and ADCD is run separately. And how we deploy all these things. We are using cloud formation stack. Quite convenient, we are just specifying. I want to have after scaling group with three nodes, one master and two replicas. And we do use elastic load balancer in order to direct traffic to master and replicas. So basically in order to conveniently connect to master and replica we also create two DNS records in Route 53. And as you may understand, story is not so exciting if there are no hidden walls or yeah. For example, if master cannot update leader key, what it should do? Probably it should demote, but how fast? Maybe if it will retry, so it will update it. But from another side, the much more time you are trying, you are probably running some transaction and as soon as replica will promote, it will not have this transaction and you will have some split-brain situation again. Console and ZooKeeper also has some configuration quirks. ZooKeeper cannot have by default session longer than 40 seconds if I'm not mistaken. And for example, console cannot have session shorter than 20 seconds. It's possible to change it, but you need to know it. You need to really read documentation. You need to learn systems you are running. So also how can we choose a new master? So by default, every single node will try to participate in leader race, but maybe you are running one node for some analytical purposes and you don't want this node to be promoted. So you can tag this node with no failover. So just you are marking and it will not participate in leader race. When you are running leader race, you are comparing the slot position of every node with yourself. And if you have highest position, it means that you are best candidate. But what if you are single one and no other nodes is there and if you are restoring from base backup, not really from base backup, but you are using Wally or Barman, so you are restoring from S3 storage, for example. But in ATCD we are still keeping last leader of time and we don't allow to promote if your X-Log position is behind this leader of time more than maximum luck on failover. So, but displaying this maximum luck on failover, you're able to implement some disaster recovery solution. Actually what we did by ourself. As I already told, so when master crashes, it's important to easily connect it back to the cluster. In order to do that, we are using PG rewind. But what if PG rewind is failing? Maybe because some X-Logs on master already gone away. So easiest solution just wipe data directly completely and run PG rewind. But so if your database is huge, you probably don't want to do it. So we also have a switch for that. Yeah, so there are some useful links. So Spilla, it's our appliance. Well, how we run Patronio on AWS with Docker. So ConvD, it's a wonderful tool which generates a configuration file from template and some information stored in ATCD to keep your console or even it can use some environment variables. And of course, ATCD and Raft. Just very useful links to see. So questions. Yeah, it's possible to mark some node as not participating in leader election. So it runs the same high availability loop. So it does everything as any other. So all the same actions and as any other nodes, it exposes information about itself into ATCD. It does everything, but it will not try to run a leader race. Basically, we have two situations for one, if you run synchronous mode in this case, only node which was synchronous is allowed to be promote. In case if this node is also down, so nobody will promote. And the second, just every node compares its absolute x-log position against any other node. So whoever is ahead will try to take leader log and promote. With the synchronous application, no, we don't really care about this. But I didn't mention in this presentation, but we have possibility to control data durability against high availability. So it might happen that, for example, master simply crash and patron will try to restart it. And if restarts really takes long, so maybe you don't really want it. And instead, you want to just promote the most recent node, so node which has most recent data. So, and we have an option which tells how long master should be started up. So by default, it's five minutes. And if you set this parameter to zero, so it will not even try to start. It will remove leader log and give somebody a chance to take over. But yeah, we don't handle data loss in case if we are running a synchronous replication. Yeah, sure, so here is a link to GitHub. It's developed open source. So feel free to install it to try it. So current implementation supports only one synchronous slave. So yeah, but with 9.6 it should be possible quite easily to implement feature when it will set more than one. Any other questions? So basically we support all versions starting from 9.3. So 9.2 is kind of already outdated. It's not really recommended to use. So if you run such old version, maybe you should consider to upgrade. Sorry, once again. So it all depends on your ping between data centers. For example, we do run our ATCD cluster on AWS. So because in Europe, in Frankfurt region, there are only two availability zones. And in order to be truly highly available, we do run one node in Ireland. And ping between Frankfurt and Ireland zones is roughly 20 milliseconds and ATCD works perfectly. So, but basically we do run Postgres highly, so Patroni and Spill are highly available only in one AWS region. Of course, they're in different, so one region has two availability zones. Physically, it's two different data centers, but ping in between quite small, like roughly one millisecond, it's good. But it's not a problem to run one node in another region, but only until ping is reasonable. And actually Postgres itself, so not Postgres, Patroni itself, just need to have connections to ATCD. You can run ATCD in one region, for example, and you can have one replica of Postgres in some other region, it's not a problem. Because all leader election is happening via ATCD. As soon as you ATCD or the keeper is running reliably, you are safe. So, thank you, thank you for coming.