 I know that we've got a bunch of Postgres users that, several of you are Postgres users. Anybody here not a Postgres user? Oh, there we go. Kinda, OK. How many people here use Docker and other container tech? Couple, OK. The, well, one way or the other. Go ahead and get started. So welcome. Late afternoon, Sunday sessions. I was a little lightly attended, particularly when you're up against some of the people I'm up against. But I think you'll find this worth your time. I'm going to be talking about doing high availability stuff with new tools for PostgresQL. For one non-Postgres person, this could be adapted to other systems and probably will be. But right now it works for PostgresQL. So rather than just starting into a long description of the architecture, I thought it would be more fun to just show it to you at work to start out with. Because there's been a lot of PostgresHA stuff in the past that has involved a lot of hand-wavy description of how things work. And instead, I think I want to show you an actual working system here. Now mind you, it's not a real production system because I'm running all of the containers on this laptop, which is not what you would do in production, obviously, you'd have them all running on separate machines. But it does make for a good demo. So first thing I'm going to do here is I'm going to use Docker Compose to bring up my containers. And let's tail the Docker Compose logs. You can actually see what's going on here. There we go. OK, we're getting a lot of stuff. And so we're getting reports from two of the nodes there, as you can see in the bottom. Nodes three and nodes one that they bootstrap from the master. We're getting feedback that those two are secondaries. So let's see. Do we have a replication cluster here? Let's see. So here are our containers running right here. We've got three database nodes, one ETCD node. I'll explain what that's doing further in the presentation. The looks like node number two is our master. So let me find out its address. I'm going to log into that. And it looks like it has two replicas. So we've got our cluster with one master, two replicas. Right here. And we are up and running. But of course, I'm here to demonstrate failover and high availability. So let us kill off the master. OK, so again, two dot one is node two is our current master. So we're going to stop it. And you see a whole bunch of activity there. This is all log output from the Petroni system. And then we get a whole bunch of stuff here. Wait. And we can see the other two nodes restarting. So we try to do this again. We don't have a connection anymore. Hold on. It's got a hold of the cursor and doesn't want to let go of it. Just a moment. Well, it TCP times out. OK. So let us connect to, I think this is going to be number. No, wrong one. Who's the master here? I'm the leader with lock three. So there we are. And you can see that now node one is streaming from node three, which is the new master. So this is the essence of our system. There's a lot of code that went into making that happen. And so now we're going to talk about that. In the meantime, we'll leave our two node cluster running. Since we do have a small group, feel free to interrupt with questions. Although at this point, I haven't even started to describe the architecture. So, yeah? All right, yeah. Actually, no, no, hold on. I meant to do that as the demo. Let's go ahead and bring that node back up, hey? It would become a slave. And let's do that. Thank you for asking. OK, so I just restarted DB node two. And you see we've got a bunch of stuff going on there. It's looking for replication slot. They can't find it. And so it creates a replication slot. Replication slots are a feature of Postgres 9.4. They're really useful, which is why we create them by default. And now, if we look over at the master, which is still DB node three, it now has two replicas. Because DB node two has restarted and rejoined, the cluster is a replica. And that would work whether we were instantiating a new DB node two. If you're restarting the old DB node two, that would only work if there wasn't any trailing replication information, unless you're using the pdrewind plugin, which I'm not going to cover in this particular session because it's new. So that's operation in a nutshell. So now let's actually describe how that worked and why it's needed. So I'm going to first start out. Postgres' built-in replication is really cool. I'm really happy with it. It took a number of years to hammer out. But at this point, it's easy to set up. It makes all kind of guarantees about replicating your data. It prevents data corruption. It prevents a lot of common foot guns with replication. That is, you have to really try hard to fuck up your systems. The worst that will happen is you'll break replication. Replication will stop working. And you can combine it with disaster recovery fairly easily. Have a replication and disaster recovery in one mechanism system. You think, hey, all of this is great, but we're kind of missing something, which is why is there no built-in failover? I mean, my SQL has built in failover, right? Well, sort of. And this actually has led to a number of things, including I actually heard this at a recent conference. I forget which conference it was at in the fall. Somebody was actually talking to a Postgres person in their Postgres booth. And they literally said to them, automated failover is too complicated. You don't want it. Well, no, that's not good enough. A lot of us have SLAs to meet. We have always on applications. And it's not impossible. Automated failover is doable, particularly if you restrict the problem. Now, part of the problem that we run into in the Postgres world is we try to solve everything for everyone. And coming up with a failover system that will work for absolutely everyone, no matter how they're using their database or what they're using it for, or what hardware or environment they're running in is, in fact, pretty close to impossible. However, coming up with a failover system for how a lot of people use their Postgres databases, these days anyway, which is a bunch of old DPR web databases running in some kind of a cloud or container environment, where we can have a pool of asynchronous replicas and automatically promote one when the master goes down. And we have the ability to have some kind of a watchdog note. This is sort of our prerequisites here, right? Is that this is going to be sort of our system. It turns out that this particular set of requirements meets the needs of a lot of people. Not everybody. There are people who need synchronous stuff and guarantees against data loss. There are people who I can't have to run on large hardware where the point of spinning up a new note is prohibitively expensive. But an awful lot of applications here fit this spec. So now I wrote out that sort of spec a few years ago and then worked with a team of my former coworkers at Postgres Experts to build a system called handywrap, based on the set of requirements. So handywrap is a master controller architecture. It's built with Python fabric and SSH. It's in production in at least one place that I know of. It's been forked a couple of times, so I don't know if it's in production elsewhere. And the idea was, because we're a consulting shop, to build something that would work with any of our clients Postgres configurations in place as they work, which often involve dealing with a lot of really screwy stuff in terms of LDAP authentication and special subdomains and all kinds of other things. It was also designed to be pluggable in order to support all of these infrastructure changes. So we worked on that for about a year and a half. Pretty much completed the initial spec for that. But there were some problems with it as a general solution. You pretty much had to have us install it for you. It was difficult to install. It was really difficult to debug in terms of when we would lose a replica out of the system. It was really hard to figure out why. In the end configuration, it had over 100 different configuration options, possibly more, depending on plugins. It scaled kind of poorly. Like it was great for like two to four node, maybe six node clusters, but beyond that. And the hand Europe server itself as a master controller was kind of a single point of failure. Like we could have a secondary hand Europe server, but it was manual failover to that. And that wasn't helpful, but we couldn't have two because then we had potential problems. So it was like, OK, part of the problem with the hand Europe design was that we were trying to be too general. We were trying to tell our clients, you don't have to change anything about how you're doing Postgres. We'll retrofit a failover system on top of that. And that is what some people need, but it's not a general downloadable solution. I really wanted a general downloadable solution. Well, in the meantime, there's this company called Zalando. And Zalando is like the number one European fashion portal. Hold on, I'm trying to figure out a way I can put down my coffee without it falling over. OK, Zalando is the number one European fashion portal. They've got about 15 million customers ship some ridiculous amount of merchandise per week. They have 150 Postgres database nodes in their environment and have to be 247.365 because they control their own shipping, not just online sales. So allowable down times are tiny. And for that reason, they needed automated, decentralized, high availability. They'd look at handy rep, but they felt it didn't fit their needs, because among things, they needed to support much larger clusters. They needed to be a lot more autonomous and not dependent on the DBAs to configure them. So while Zalando was looking at this, and while I was getting dissatisfied with not being able to make handy rep portable enough, some stuff happened. Now, Zalando ran into a lot of the common problems that you have. They had an initial, they tried to do it on their own, and they ran into a lot of the common problems that you have with automated failover and high availability, right? False failover, where you failover you didn't need to, which is always problematic, because you have to break all the application connections, you have to reconnect. And if it's happening all the time, that becomes a bad experience for the user, misfires where you try to failover and you can't complete it, race conditions where you can't figure out who the new master is supposed to be, and you either end up with two or none, those sorts of things. And then they ended up with, and they encountered the worst problem with automated failover. And does anybody know what the worst problem with automated failover is? But yeah, exactly. Split brain. So OpenX kindly supplied me with this little brain. So we've got split brain here, here, half a brain. Oh, and here we go, half a brain. Ta-da. So split brain. Yep, split brain is our big problem with that. And that from the perspective of people who are approaching this from a transactional database where consistency is considered important, split brain is kind of the worst place you can end up, right? It's actually usually better for the system to be down than to have a substantial risk of split brain because there is no automated recovery from split brain. And under some catastrophic circumstances, no recovery at all. So what we really needed was we needed a service that could come in and bless all of our little cloud posters. Those are little flying elephants, if you can't tell. Bless all our little cloud posters. Bless one of them is the master. And if that one goes away, bless another one is the master. And be consistent and immutable and independent about it. So what we were thinking about all of this, a company called Compose.io, which was an online hosted cloud web application as a service project. They're about to be acquired by IBM. And before the acquisition, they open sourced a bunch of their stuff, including the system that they used to provide high availability for Postgres. Not the complete system, but the initial proof of concept that they did that became a system when it got integrated with their architecture, et cetera, right? And they open sourced that and it was called High Availability for Postgres Batteries, not included, it was called Compose Governor. Now, this was just a proof of concept, but it was a really, the ideas behind it were really good. So, and part of it was that they used a lot of, at that point, relatively, this is about a year ago, they used a lot of, actually it's not even a year ago. God, six months, seven months ago? Seems like a little much longer time. But they used a lot of technology that was just emerging, right? So Linux containers, ETCD for consensus, and a simple Postgres controller that lived on each node. And so, we forked it. Zolando forked it, I started contributing to it, into a new project to actually make it production-worthy. So, now, that's our background, let me explain how it all goes together to actually understand the new system. So, the first thing to actually understand about this is that there's actually three parts to database failover. Part number one is detecting when you need to failover. Part number two is actually failing over the database, and the third part is failing over the application from one database node to another. Now, in the current Petroni system, part number one, detecting failover, is handled by time stamps with an ETCD. Again, I'll explain ETCD in a minute. And API checks on the individual nodes. The clocksters have to be consistent in the ETCD cluster. And actually, honestly, if your time stamps are wildly out of whack in your database cluster, you're going to have other problems as well. That won't go undetected for long. The second part is what's called leader election within ETCD in order to decide who to failover to. And then, of course, Postgres replication failover. The third part failing over the application is not yet handled in Petroni, although I'll talk about how that's handled by external systems later on. So, here's how it works. So we have our little Postgres node running in a Docker container. So here's our elephant in the Docker container, right? Now, that's not just running Postgres in the Docker container. In fact, it has this little Petroni demon, which is a pilot. And the Petroni demon actually controls if Postgres starts and stops and controls its configuration. The idea being that if Petroni isn't running on that node, neither is Postgres, period. This was one of the big problems I had with handy wrap is that I ended up with all this complicated logic of, OK, how do I detect if Postgres is actually down or if it's just handy wrap that's down? Well, the answer is we set it up so that it is not possible for Postgres to be running if the Petroni node is down. No, no, you can actually set this up on VMs and you can even set it up on real hardware. Yeah, yeah, it's a lot, it's, you know, containers actually encapsulate this a lot better, but no, I mean, actually, I believe Zalendo's doing this on VMs. And so then you've got one of those and you've got a whole group of these, right? So a whole group of our little Postgres containers piloted by Petroni in each case. And so this is just a little Python file running as, in this case, in the container case, this is running as the application of the container. The nice thing about the container approach is you don't have to take extra mechanisms to make sure that Petroni shuts down, Postgres shuts down, because in a container setup, if my application for the container is Petroni, if that application stops, the container stops, and that's controlled by the container infrastructure. If you're running this in VMs or real machines, then you would actually need to do some extra stuff with, say, SystemD to make sure that if Petroni exits, that the Postgres Postmaster exits as well. So, and that's the advantage that running this in containers buys you, even if you're gonna have one container per machine. So then what happens is we say, okay, we've got three co-equal, we've just started up three co-equal Postgreses, so how does this become a replication cluster? Well, so what happens is these three co-equal Postgreses, we need one other element here, which is an ETCD cluster. ETCD is a distributed, consistent key value, it's actually more of an HTTP information store. I'll talk a little bit more about it later, let's just understand how it functions right now. So ETCD, we've got an ETCD cluster here, which functions as a single consistent service. And so what happens is when we start up our nodes, they all send messages to ETCD requesting to be the master. And ETCD holds what's called the leader election, and it decides one of those wins the master election. And it sends messages back to them and says, okay, node two, you're the master, nodes one and three, you are replicas, and your master is node two. And so then we do an automated base backup from node two, in order to make node one and three replicas, yeah? Yes, yes it does. ETCD does in fact store its data and it writes it's synchronously to desk. So this does recover from being down. Although you might see a whole bunch of failovers if the system doesn't come back up all at once. Which is often the case, yeah? It could just be two, it could just be two, it could be five, it could be 15, whatever you want. Yes, no, there's no particular one. It's just that three is good for a demo because it's enough that I can actually demonstrate failing over without enough that I use all the RAM on my system. So anyway, so that's establishing the initial cluster replication for a brand new cluster. Now if you were retrofitting this onto existing Postgres servers, you would have to actually take some extra steps. But if we're doing it as a blank canvas and we're gonna load everything via PG restore, then you just do this, right? So then the question is what happens when we lose the master? You know, as I just demoed, right? We lose the current master. Well what happens is all of our nodes are sending back, are sending messages to the ETC server every 10 seconds in the default configuration. You can configure that interval. They're sending every 10 seconds. Those notices have a time to live of 30 seconds. So after 30 seconds, they're going to check ETCD and discover that there's no longer a master. Hey, no master exists anymore. And then what they're going to do is that both of the one of them, whichever one happens to do it first, is going to try to grab the master key. So now during a failover, this happens in two stages. An initial deployment, whoever grabs the master key first gets it. In a failover, we want to be more discriminating. So what happens is whichever one does it first tries to grab the master key. It gets a temporary lock on the master key and then it checks the replay point of all of the other potential failover nodes using the Petroni API. Because each one of these Petroni demons, I forgot to mention, each one of these Petroni demons not only controls Postgres, but it also has a RESTful API that is used for some of the operation of Petroni. And so over the RESTful API, we can query what the replay point is on each one of the different nodes. So if, for example, one tried to grab the master key and then it checked the replay point on node number three and it discovered it was behind, it would give up the master key, at which point number three would grab it and promote itself to master. At that point, so that's the election process. Two-stage election process for failover. At that point, we send back a message. You know, ETCD confirms you have the master key, you don't have the master key. And then the node that doesn't have the master key changes its primary connection info, it changes its replication source and starts replicating from the other server. And we failed over. Any questions about that so far? That's a very good point. So let me actually talk about this a little bit more when I talk about split brain because that's exactly what this goes into. So here, wait, hold on, we've got another brain to split. Who wants half a brain? Have half a brain. Anybody else want half a brain? There we go. Yep. So, because I haven't actually talked about how do we prevent split brain? Well, in this case, we're actually relying onto a lot of ground work done in ETCD. For those of you who are not familiar, because most of you did not raise your hands earlier at being anybody, some people work with ETCD and similar services, consoles, zookeeper, yeah. So ETCD is a distributed consensus HTTP data store. It stores all of its data as HTTP paths. So it's kind of a document store. It uses the raft algorithm, which is one algorithm for what's known as distributed consensus. The raft algorithm is designed to implement in the CAP to implement CA, right? As in it's consistent, it's available, it is not partition tolerant as in, and I'll show you that in a minute. ETCD is great for configured information and metadata, because one of the people says, hey, if ETCD is able to maintain consistency across a cluster, why not just put our database data in there? Well, here's one of the problems. It's really frigging slow, compared to like a transactional database. Like, you know, the number of writes per second you need in ETCD is measured in like the tens, because it's doing this whole consensus thing on the backend. So you don't want to store real data in there. And as a matter of fact, we go to some trouble within Petroni to only write things to ETCD that need to be there versus things that we can pull the API of the individual nodes to get. Because initially I had this design where I was constantly updating the replay point in ETCD, and it turned out not such a good idea. When you're running ETCD in a really lightweight container where there's a lot of other stuff, it can be a problem. So, now there's some alternatives to this. People familiar with ZooKeeper, which tends to be a little bit larger scale. You know, it's the big Java-based thing. There is support for this in Petroni. You can run Petroni using ZooKeeper. Console by HashiCorp is the other one. And that actually has this nice thing of integrating discovery services as well. There is not currently support for that in Petroni. We don't have the module for that, yeah? No, no, that goes into Petroni configuration, which I'll show you in a minute. The, so, the, initially I actually had a mission to support console. I was writing a console support module, and then that use case went away because that particular user switched to ZooKeeper. If somebody else wants console support, they're gonna have to write the module. Shouldn't be too hard, we've already got examples of how you do both ZooKeeper and ETCD. Anyway, so the idea of ETCD with distributed consensus thing is if we actually have a network partition, then the ETCD cluster knows if it can't access, if it can't establish communication among a majority of the nodes that were originally in the cluster, then it responds with failure messages to information requests, rather than providing the information. And that's deliberate. And that means that we prevent split brain due to a net split because any database nodes that can only connect with the stub of the ETCD cluster will get back failure messages. Now what happens with Petroni when the failure messages is, that database node will reboot in read-only mode if it was read-write. If it was already a replica, it will just keep going. Because the problem is, if that was our original master, but it's now in an isolated network segment, we do not want it to be continuing to accept writes. But it's okay for it to be up to continue to accept reads. I mean, we'll get stale reads, but presumably in a net split situation, somebody is getting pager alerts. And presumably, we're going to actually straighten out the application connections at some stage. And so that's basically how it sets up. For ETCD, this means that your cluster is statically sized. Changing the size of the ETCD cluster requires a restart of the cluster, I believe. Which is a little bit annoying because obviously during the restart of the cluster, you're going to have a flip over in Postgres and a master election. So give it some thought. The useful sizes for an ETCD cluster is usually about five nodes, if you're really trying to guarantee against failure situations. So there's not a strong reason to make it larger than that. Zookeeper, I believe, can be actually dynamically resized. It uses a different algorithm for consensus, but it does perform a lot of the same functions. Yeah? Yeah, well, if we're doing asynchronous replication, yes. And under any failover circumstance with asynchronous replication, you will have lost some data. Yeah. If you can't afford any data lost and you need to set up synchronous replication, we have within Petroni, we have some configuration options for synchronous replication. I'll warn you that I don't think anyone is using them. So test the hell out of that if you're going to actually go that way right now. In general, for a lot of the stuff we're talking about Webby stuff, we're willing to accept the loss of a couple seconds of data versus being down for an hour while we wait for human being to check it out. That is, however, another reason to not actually restart existing nodes automatically. There's several reasons. First of all, if the node failed in the first place, you don't really want to restart it automatically because you don't know why it failed in the first place until a human looks at it, right? Second thing is, if the node failed in the first place, it has untransmitted data, if you restart it and force it to rejoin the cluster, you're going to wipe out that untransmitted data. Whereas if a human being restarts it, isolates it from the cluster, they could potentially recover some lost transactions. Yeah. So let's look again at the setup in detail now that you actually know what it's supposed to be doing. So first of all, I mentioned that each one of those nodes is running a Petroni demon. So this is actually the configuration. So you configure Petroni on each individual node and you pass it along configuration. This is the configuration that's getting loaded through Docker Compose into each node. The placeholders are for environment variables that are being loaded through Docker Compose. So we actually want to take a look at that. There's a few of these things. So scope is somewhat cryptally named. It's the name of the cluster that you're running it in. The idea is that you may have multiple database clusters in a single network, right? That you're running Petroni on. They may even share ETCD or zookeeper servers, in which case you actually need to have namespaces for each of them and that's supported. Default time to live. Default polling interval right here. And then we've got some other configurations. Like I said, there's a REST API running in each node and so this is where you tell it what IP and port it's going to be listening on. Within that, what the advertised connection address is going to be. The reason why these are two different lines is because if you're doing some sort of network redirection or network masking, particularly for say service discovery, your advertised address from outside the container or the VM might actually be different than how that address is seen internally and we need to support that. Most of the time, these are going to be the same. So currently the API supports SSL and simple I statically set user password authentication. We haven't had a strong push for supporting something like LDAP or whatever for the API. So again, if that's something that you need, fork the project. That's what it's there for. And then you actually need to configure your distributed configuration information service. Now the configuration for ETCD is really simple. We've got a scope, we've got a time to live, we've got the host. If you actually have authentication set up in ETCD of some kind, then that information will be there too. Zookeeper configurations tend to be a little bit more complicated. There's an example of that in the docs of the different elements that you need for a zookeeper configuration. So that's where you configure what ETCD cluster it's connecting to. As it needs to be configured in each individual node. Then you actually need to configure a bunch of things of a PostgresQL. Now one of the reasons that you need to do this is, if Petroni is initializing your Postgres cluster for you, then the configuration you put here is the only configuration there is for PostgresQL. Because it needs to be able to rewrite the configuration in order to restart. If you're retrofitting Petroni onto an existing database cluster, then it might be a little bit more complicated and you might be able to ignore some of this stuff. But if you're doing it the way that I just did it, let's spin up brand new containers, et cetera, then all of your PostgresQL configuration comes from here. And this is all of your typical things for PostgresQL configuration, including listen address, again the Advertise Connect address, which might be different, the Data Directory, Maximum Lag on Failover. This is, we do a check again through the API of how far behind are you? And so in addition to trying to choose the furthest ahead one, you can also set a threshold and by default you do saying hey, if the replica is this far behind, don't fail over anyway. Probably in production you'd wanna set that more to something like a gigabyte or more depending on what your traffic is. And then the rest of this is actually, oh, create replica methods. And I'll mention this again later on. By default we use PG Base Backup because the simplest way of spinning up new nodes, we do actually support other methods. PostgresPoint and Time Recovery, Wally Recovery in order to deploy new nodes. Say if you might have a larger database for whom doing base backup is prohibitively slow. So one of the things that I'm actually working on is modification to allow you to take the base backup from one of the other replicas, which is not currently supported but will be supported in the future. The other thing that you actually have to set up is your host-based access, your Postgres access control file. Again, this needs to be written by Petroni if Petroni's initializing the database cluster. We set a whole bunch of passwords because again, we're initializing the cluster and those passwords need to be set. And then if you need to pass any parameters to Postgres, like for example, if you're doing archiving parameters for replication, et cetera, those are going to need to be passed via Petroni because Petroni is writing the, it's actually not writing the Postgres goal.conf. It passes these things on the command line versus via Postgres command line options. Yeah? I'm trying to remember we did that. That was a bug I filed early on. When we initially initialize the cluster, we have to launch it in trust mode so that we can create the passwords. And it wasn't getting relaunched and so as a result, we were locking ourselves out. I think now you don't actually have to do that. In this case, because it's a container with no SSH, it's perfectly fine to say local all trust because no one can get into it, unless they can hack the container and if they've root access on the machine, they can get into it anyway. So in a different circumstance, you might not want to do that. So it would be worth testing and if not, add your thing under this bug and say, hey, it's still not fixed. So again, if you're doing that on the end, you're going to have to do that. Yeah, and part of it depends also as are you initializing this via Petroni or not? If you're not initializing it via Petroni, then it's not important to have local trust access because you will have set these passwords. Do you follow me? Rather than Petroni setting them for you. So I actually was working on that and I discovered a problem with Postgres which is we can't pass include directory and include file via the command line in Postgres. It has to be in a postgres.conf physical file, which is kind of a pisser because it's not in the default file. See my long arguments on hackers about yconf.d should be default behavior in Postgres and which I lost. But so anyway, so yeah, and so that was the idea, is that like you would actually set it up, I was looking at a modification for having a conf.d file. I might put that in by default in the future with Petroni, would be a conf.d file. If you're doing containers, you would mount that as a volume. If you're doing it in VM, you'd do it wherever you wanted to and then you can actually have another place to drop in configuration options rather than passing them in the Petroni command line. The advantage of passing everything through Petroni config file is that you basically have one master config file to rule them all that can get checked in under whatever configuration management is and then you don't have to worry about having several separate configuration files. Might be an advantage, might be disadvantage dependent. So let's go ahead and do that again. Okay, so let's shut this down. Okay, and I actually need to tear down the containers. The reason I need to tear down the containers is because if I don't tear down the etcd container, it'll just restart it and the etcd writes stuff to disk so it already has a master marked as initialized. Actually, no, if I did this, so there are circumstances where Petroni can be unable to restart without human intervention which is if etcd is down and then all the database nodes shut down and then they come back up and there's no master and you don't have a, well, except if you don't have valid etcd cluster you're not up there anyway. Anyway, restarting it multiple times. Here, I wanna actually restart it from scratch. So we're gonna restart it from scratch rather than your failover circumstance. So again, we get lots of output. You'll carry, you can see all this stuff. The failed to acquire initialized lock is the, so there's two kinds of master locks, right? There's the I am the current master lock and there's the initialized lock. The initialized lock is held for the entire cluster and it's only set once. Once it's set, it lives under that cluster namespace in etcd until it gets deleted. And the reason why is you don't wanna accidentally re-initialize a database cluster that has data in it. You can actually get, I'm trying to remember the specific sequence of events because I've done it. You can't actually get wedged under circumstances where the Petroni cluster will refuse to start in read-write mode because it can't find an initialized master but the initialized lock is still set. We decided that was better than having it wipe out all your data automatically. But that would be under a circumstance where everything went down and then sort of came back up again unevenly. And those are hard to protect against 100%. So we see this going on. I don't understand why when I start this particular demo two always wins. I don't know why. Because it's actually the second one being started but for some reason the timing of it two always wins. So, wait a minute. Two always wins but doesn't always have the same IP address. Oh, it's three this time. Okay. So again here we see node two has two replicas, nodes three and one. So that's our initialization setup. And so then we wanna go ahead and kill two, right? And you see all of this traffic of stuff not being able to connect it and that sort of thing. Actually that time it filled over really fast. Part of it depends on the timing and you've got a 10 second polling interval, right? And it depends on whether you hit that interval immediately or later on. So, and then we get this. Now there's no contesting who's furthest ahead because I'm not running any traffic on this cluster. So they're all co-equal points. So we're gonna fail over and now we have two nodes. So who's the leader this time? Three, three is the leader. Well it's either gonna be four or five so let's find out what it is. Am I the leader? No, there we go. So one is replicating from three right there. And so now we actually want to bring a node two back up or bring a new node two up. Actually, hold on. Let's actually wipe out node two, huh? Because under a circumstance we had a real failure you wouldn't be bringing back up the original node, right? We want a new node two. So now we're going to go ahead and bring node two back up. Hold on. There we go. And node two has connected. And we now have two replicas this time. Two and one are replicating from three. Anything else you wanna do this cluster? It's all temporary containers so we can, if we screw it up I'll just rebuild it. That is a very good point. Yeah, let's do that. Yes, you see all of these sort of reconnections, that sort of thing. See demoted self because DCS is not available. That's the message you're getting. The original master is restarting as a read-only node. Because the thing is, postgres, you can start any node as a replica of a non-existent master at which point it becomes a read-only node. And because we support cascading replication it doesn't require breaking replication to the other two nodes. So we're gonna keep getting this message of course because it's still trying to connect to ETCD. Now if we bring back up ETCD, ta-da. Okay, let's see who's the master now. Node two actually promoted itself, that's interesting. Oh right, because the TTL had expired. Yeah, so that'll happen if you actually, if the ETCD cluster goes down, when you bring it back up you may get a failover even though it's not strictly necessary. And that again is a timing issue. Right, because the ETCD cluster has been down for longer than the master locked time to live. Then when it comes back up, the individual node doesn't know the difference between the master being completely down and the information service having been down. And so it treats it as a master election circumstance. Now in this case, because all of our nodes were read-only we lose no data because the nodes will have been at the same replay point at that point anyway. So that's why it was not regarded as an issue for us to fix because it's not really a problem. Now if I brought ETC down and then I deleted it we would actually never come out of read-only mode because it would keep polling for the list of servers and it wouldn't find them. So we're getting towards the end of our time here and I've already taken a lot of questions, it's not. So let's actually finish talking about the other stuff here. So what's included currently in Petroni? Three things, the Petroni agent, which again runs on each server, which has a RESTful API. There is also a command rudimentary command line tool that's under heavily development called Petroni CLI that is steadily improving to have its full sort of spec of features but isn't complete yet. Petroni CLI is just basically again a Python command line tool that allows you to do things like interrogate the API of the individual nodes to ask them for stuff and importantly do things that wouldn't happen automatically like say manual failover. Like say you want to manually fail stuff over because you're gonna apply kernel updates to individual nodes, right? So you wanna do manual failover or if you wanna do manual failover because you're put node on a VM with more memory or whatever. So also to stop individual nodes, et cetera, you can do that via Petroni CLI. Stuff that's not included. Now I mentioned that the third part of failover is to fail over the applications. That is not included in the core Petroni project. And the answer, and the reason for that is that it's going to be provided by other things. There's no GGUI interface for Petroni and I don't think anybody has any plans to build one. Deployment of containers, et cetera, VMs or whatever that's your own thing to do or do it via downstream projects. And there's no built-in monitoring except that the APIs do provide a lot of information. So a monitoring system can interrogate the APIs and get a lot of information for monitoring. We just don't have any templates set up for that. So given that Petroni doesn't cover everything, we have a couple of downstream projects. The one that is sort of in production right now is a project called Spilo from Zalando. So Spilo is Petroni plus a whole bunch of Amazon tool orchestration. They're a very M AWS integrated company and so they use all the AWS tools, the Amazon virtual IPs and the load balancers and that sort of thing, on top of Petroni to provide a complete system. And that's available from their stuff with documentation. If you're actually going to look at implementing Spilo though, I'd be prepared to devote a significant amount of work time to it. It's a complicated system with a lot of parts and it's only ever been deployed at one company. I mean, they have done a really good job of trying to document it, but it is very complicated. The one that I'm working on, but it is not available yet, but check back in a month or so. I'm nicknaming AtomicDB. I'm going to be using the Atomic slash Kubernetes stack to supply more of a complete system. Again, on top of Petroni with Kubernetes. And service discovery in order to provide the routing and failover portion of the whole thing. And also maybe provide some OpenShift containers that are Petroni based for anybody who uses OpenShift. So check back in a month or two or just follow my blog and you will see that as it develops. More features, there is PG rewind support. So if you're okay with wiping out data, you can enable PG rewind, which means that down nodes will be guaranteed to be able to rejoin the cluster, but that may mean that you're wiping out data they contain that's not on any other node because of failover. Again, I said we do have configurable node imaging via wall e and point time recovery. There is synchronous, there is instructions on how you would enable synchronous replication support. Like I said, I don't think that's really tested. So you might want to do some testing. And we also now have ways to flag specific replicas as non-failover replicas, which is something you'd want to do if say you have 20 replicas for load balancing, you generally want to have like three or four designated failover replicas that don't take load and the rest of the replicas are going to be load balancing replicas. Other things in development, cascading replication support, obviously for geographically distributed stuff, you'd want cascading replication. In integrated proxy, somebody actually wrote something in Go for another project that I was looking at potentially integrating because it's a nice little proxy that will query ETCD or Zookeeper to find out who the current master is and reroute stuff. And then when bidirectional replication actually becomes a thing that mere mortals can install, we'll want to look at supporting that. But in the meantime, if you see features that we don't have and that you want, it's on GitHub. It's mostly Python. You can fork it. So here's all list of resources as you can see for this. Either take a picture of this slide or I will be pushing this to my GitHub page, jberkus.github.io, after this presentation and then you will have all of those links. Yeah, so. This isn't pushed yet, but it'll be pushed within the hour. So any final questions before we, oh, we actually got five minutes for questions. So go ahead, we can even like clobber the cluster if you wanted to, yeah? Yeah, so the demo is separate. So the thing is that the test suite related to the code is just is unit testing, which is great to have, but it doesn't test some of the things that we want to test like this failover work. And that was the reason I created the Petroni compose project, which is what I'm using here. Oh, I actually don't have a link to that. I'll link that, I will definitely link that off of my webpage. The Petroni compose project is a Docker compose file that is designed to set up a cluster so that you can actually run automated test things unlike does it failover work? Can I add a node, et cetera? That needs to be built out into a full test suite of failover behavior, which doesn't exist yet. So that's one of the things that we actually kind of need. The, I'll probably build that out also via Kubernetes, is that once I actually get a Potomac DB, we'll have Kubernetes and that will have its own test suite built into it because it'll be a little bit easier to automate than what I've got with Docker compose, which is a little rudimentary. So yeah, you had a question? Yeah, yeah, yeah, okay. Well, I'm gonna see about PG pull two is, I don't talk about PG pull two because I don't wanna trash other people's code in the community. As I say, I would personally not use PG pull two for automatic failover. Other questions? Okay, well, thank you very much. No.