 All righty, geographic failover. All right, so I'm Sean Chittenden. I've been working with Postgres and infrastructure operations for close to 20 years now. I wanted to start today with a quick demo, however, to show you what it is that we're going to be talking about so that everybody understands the context for which and what's possible. All right, so I have a simple four node cluster here. I have a total of, oops, this is an artifact from earlier. Same thing, ignore the bottom one that says left. I apologize. Anyway, we have a four node cluster. I have three nodes that are participating in a data center. I have a fourth node over here, all by itself participating in its own data center. We are going to, if you see, there are three nodes in the lab one data center and one node in the lab two data center. We have a copy of Postgres running on each node. They are replicating their data out. You can see this here, great. We are going to failover these, the two databases to a second to the lab two data center. The service, hold on a second. Sorry, it's hard for me to see over there. Let me just take a gig. Well, I got it, dig pgdb.service.console. We can see that there are two hosts in this particular query or in this particular data center. We are going to use a geofailover version. There we go, okay. There we go, so we got two nodes. We can see that the IP address down here is 141. 141, I'm going to go through one by one and shut these Postgres instances down. We will see that we have only one instance in that particular data center now. That's local, we have a different data center. Okay, what we just looked at here. Oh, actually let me go back to this. And what you can see here is that we have only one receiver. We'll start one of these back up. And we will see that we're back to our healed condition in the one node left in the local data center. Okay, so DNS isn't magic, but configuring it is an interesting element. One of the things that we just, or this is a mockup of what it was that we just demoed. So we did that using a tool, it's open source called console. Console is a service discovery, health checking, key value store, and it is data center aware. The installation of console is skipped that, we'll get to the installation in a second that slides out of order. So at HashiCorp we have a number of tools. They started out with Vagrant along the way, came out with Packer, Nomad. Console's the one that we're going to be focusing on with this pipeline goes from the development side of the house all the way through to the operation side of things. Console is very much oriented towards the operation aspect of things. So there's a couple of terms that I'm going to reference throughout this talk. First is an agent. An agent is anything that runs the console process, either server or client. Client is simply a process that is running and able to answer RPC calls. Server handles the actual data itself. Console under the hood uses a consensus protocol called Raft. Raft is a distributed consensus protocol. We will get into that briefly as well. All nodes gossip between each other. We will explain quickly how that works. And each node has a concept of a data center for locality. But we will see that that concept of locality actually has to do with network distance, not some arbitrary definition. So console is a highly opinionated framework that provides a DNS and HTTP interface. It's very scalable and it's operationally quite simple. Under the hood, the architecture is reasonably simple in the sense that you have a series of clients that all talk to a number of servers. Inside of the servers, they self-elect a leader. The information between the servers is replicated. All clients gossip between, or all agents gossip between each other. So even though there's only one line between client and server, there's a gossip pool between all nodes in the cluster. In order to decompose problems, or just because of speed of light, what you can do is set up multiple data centers. In this case, servers know how to talk to other servers in different data centers. So the gossip pool for a local gossip pool stays within a data center. However, there is a WAN gossip pool that happens between different servers. If a client wants to go and one of the things that happens as a result of this is when there are clients in data center one that need to talk to clients in data center two, there is a lookup path in order for a client to resolve a piece of information in this case data center two. What happens there is a client will perform an RPC request. That RPC request will talk to any server, doesn't matter. The server will, if it can resolve it locally, will forward the RPC request from whatever server it talked to to the leader, if necessary, the leader then will resolve the questions locally. If however, the question is targeted to a remote data center, that server will forward the RPC request to any server on the remote data center side. On the remote data center side, it will forward the request to the appropriate leader and then return the result back. So when I was talking about clients and agents, there's a number of ports and network ports and network services that are provided by console. In this case, this is a little quick diagram that explains where each or all of the different ports. There's a number of ports here that are of interest. The LAN gossip in particular, I think, is an interesting element to this in that by default console will use a protocol called surf under the hood, also written by Hashi Corp. That does liveliness in order to figure out if a host is alive or dead. It does use UDP by default. However, UDP is obviously a lossy protocol or can be. And it also falls back to TCP in order to provide strong guarantees while necessary and perform anti-entropy in order to make sure that the state of the cluster is actually accurate. Clients listen on HTTP and DNS interfaces and they relay those requests using the back end server RPC on TCP 8400. So you can see here that the WAN gossip protocol is only used between servers, not agents. So the failure detection aspect of console makes sure that when something dies or is not healthy, it's very quick to detect. So now we can actually get to the installation. So as far as ease of installation, it's typically a configuration file. There's only a handful of things that you need to put in there. It's generally the same config for every agent in the cluster. You specify a data directory for local information. You specify where you want it to listen on either the network or a socket. You specify how you want to handle your ACLs. And you specify a gossip encryption key. Those are the highlights with the exception of the DNS configuration. DNS, in order to scale out and not have every single request forward to the leader, you can specify the DNS interface to allow stale information. What that means is slave servers or follower servers are allowed to answer the request, potentially return stale information. So the actual execution and running of console is very similar or very simple as well. You say console agent, pass it a config file, a configuration directory for supplemental services, set something up to logs, in this case, standard out. And you're off to the racism, there you go. So console provides a service directory and a node directory. The node directory is maintained through the surf gossip protocol. We saw this earlier. I will come back to the console info aspect and the things in a minute. Under the hood, console uses raft. I want to break real fast and pull that up so you guys can see. Under the hood, this is how console servers replicate information. It is a fault tolerant distributed consensus that provides a distributed state machine to all agents. So in this case, you can see that server four is acting as the leader. We can make a right request against that agent. You can see how that information replicates out. We'll make another right request. This form of distributed state is very different than what Postgres uses internally, which is, in some ways it's similar, but it is different in the sense that this protocol is built around consensus, so we can actually stop the leader and force a timeout. And then nodes here, we'll self-elect, speed this up here. Let's say there's a network glitch, new election is happening. Bring this old node back online. Actually, let me go ahead and make a new right, there you go. And we'll resume, and we should see everything catch up. So what this means is you can, unlike Postgres, you can actually kill line your database instances if you wanted to. There does a slight service interruption, however, that gets reconciled very quickly. One of the problems of doing demos when you have a broken wrist. All right, so the console info, since I've switched tabs here. Console info provides a bunch of information about the state of, in this case, a server agent. It tells you who the leader is, what the state of the raft log, that's the log that we just saw being replicated here. It tells you one interesting stat here that shows you the degree of latency involved. All of the nodes are pinging each other on a sub-second basis. So the level of liveliness of all nodes is, the tolerance there is pretty tight. Some information about gossip, that's less interesting. But again, it's the heartbeat there and the fact that the servers are communicating very aggressively that I wanted to point out. So service discovery, what this means, or how this is implemented in the primary interface inside of console is DNS as we saw. There is an HTTP interface that has additional information. It has the actual addresses, IPR, and the port number for each of the services. I didn't show it, and I will hear in a second that console supports SRV records. So if you have applications that you know how to use and perform SRV lookups, you can resolve both the IP address and the port dynamically on the client. Very convenient. So creating an actual service is a very easy thing. So remember how in the initial config, we specified a config directory. Well, a very common way of deploying this is to go and splat out a JSON file that describes the service and the health check for that service. So in this case, we have a service named pgdb. It emits a tag called slave. We'll see how that gets used in a second. Port, whatever, five, four, three, two. And then an array of checks, in this case, use a lookup and check postgres, whatever that is. And it needs to return exit code 0 for healthy, one for warning, two or anything else is an error. It is Nagios compatible. It was deliberately done so. So what that is, dig slave.pgdb service console. And what you just looked up there is all of the database instances that have the slave tag applied to it. Similarly, you could do a lookup for the, so you get the A record back, or all of the A records. Similarly, SRV record, like I was saying, which contains the port number. So the nice thing about DNS is it's a zero touch config, in the sense that you can deploy console, integrate it with your DNS infrastructure, and then all of your applications can start leveraging console. Out of the box, it just works. And because console provides low DNS TTLs, the state of the information is reasonably accurate. Normally, it's acceptably short. So if you do need, like I said earlier, richer interfaces, there is an HTTP interface for custom integrations. There's a number of tools in the console ecosystem, including HTTP load balancers, that leverage this console service discovery information. So host checks, as I said, and service checks both are available. Again, the protocol is very simple, zero is passing. These are exit codes, zero is passing, one is a warning. It will continue to be in the DNS records. Only failing checks get pulled out of the service discovery catalog. So the checks themselves are all running on the edge, which means that, actually, hold on, I've got this coming up. I'll come back to that point. So we've got a host level check in this case to check the memory. Again, very simple, runs aggressively. It's okay to have these run every 10 seconds, whatever. Internally, there's also built-in support for other check types. There's a Docker check type. There's an HTTP check type. And then there's the shell check type, which we saw earlier. So the checks run on, typically, in like a Nagios type infrastructure, you have an internal health checking service. That health checking service will fan out and talk to all of the, in this case, web servers or database servers. Say, hey, are you healthy? And then, obviously, the service responds back, yes I am. You trust that? Great. And then you go to the next one and he says, hey, how about you? And it comes back and either says no, or times out. Maybe there's a network partition who knows what the case may be. But the important part of this is, is all of this information is coming back to a single point of failure, right? You have just created this massive spoff out of your health checking service, right? And as your infrastructure grows, that's potentially thousands and thousands of checks per second that are streaming back to a single node. If anybody's scaled out Nagios before, you are very familiar with what this problem looks like. So because console does things differently in the sense that it does all the checks at the edge and it only advertises when there's a state change, that means that the console service only receives tens of requests, right? Whenever something changes, not every time there's health information. Remember the Gossip protocol is sitting there running continually. When I say continually, Gossip is running and paying other nodes on the network every 100 milliseconds, right? So you have a very, very good understanding what's alive and what's not. So if a node disappears off the network and like web end here disappeared, you don't know if that box web end died or something on the network died. Doesn't matter, you don't know. In the case of console, you can figure out pretty quickly that that box is no longer there. Which means probably all of its services that are on that box because it's not available on the network are also in a non-healthy state. So all you have to do at this point in time is act back when something changes which lowers that order of magnitude. It's push or the checks happen on the edge, that's the important part. So and that works because of the way that the Gossip protocol handles this fan out which is a constant background process. It's a consistent load on your networks. So what that means and the way that that works internally is each agent will sit there out of regardless of the size of your cluster. It will pick five nodes, select them at random. Each node picks their own five nodes, does this every 100 milliseconds, sends out a UDP ping, gets something back, says, okay, you're alive, and moves on, does this every 100 milliseconds. And it creates a very good network map internally that we actually use internally to calculate the network distance between every node. I know what this is, shoot, hold on a second. Console RTT-H, let me just do this instead, sorry. So console RTT returns an estimate. I have a typo in my configure, I'm not gonna change it right now. Returns the round trip time between this node and any other node in that work or between any two arbitrary nodes. So if you're on node C, you can ask for what the round trip time is between node A and node B, right? And this also works across remote data centers. So internally, we have state information that says here, I was unable to talk to a particular box, great. My service is up, right? I have one that's failing. So I'm gonna go and remedy that real fast, which is, that information propagates quite quickly. And you can get the actual output from your command. So this provides a really nice distributed dashboard as well for what it is that you have. And it's, like I said, multi-data center. So I could go over to, in this case, lab two is my other data center. There's only one node and I can see that. Internally, there's a service that is managed by console, called console. You can't ever register a service called console because that is special and privileged, all right? So internally, there's a key value store, which is also useful. You can provide distributed runtime information about the state of what you want your cluster to be. So I apologize for the color scheme there. In this case, we are setting the value bar to the key foo. So it's true. I can go back and get that key, get foo. And internally, you see that it comes back with a base 64 encoded value. It has some additional metadata, flags, tells you what transaction ID created it or when the last time this was modified. And those are important values, I'll come back to why in just a second. Console does provide you with an ACL infrastructure. By default, when you configure up a deny by default infrastructure, you need to specify an anonymous key or anonymous ACL. This is really important because in the DNS world, you can't exactly pass a token. You do need to have, and this is how queries and requests are authenticated in console. So, in the DNS interface, there is no such thing as a query parameter that you can pass along. So you need to set up some form of ACL, recommend the anonymous token in most environments. So what that does is this gives you the ability to query any service, any portion of the key value space if you want to do that, and prepared queries. So prepared queries, what are they? Prepared queries are suppose you've got an environment where you have multiple data centers, multiple services, and you'd like to be able to talk to a service, whatever's alive, that whatever's closest, right? A prepared query provides you with a policy framework in order to specify this. Console reacts on a machine timescale, right? It's sitting here pinging every 100 milliseconds on the gossip layer. It's running those service checks every once a second at times. That is something that a human is not, nobody's going to be involved in the response in the event that something is marked as critical. Assume it's gonna go critical, potentially it's going to heal, maybe it's just a flap, who knows, doesn't matter. You want something to be available. So a prepared query provides this framework in order to specify, declaratively, what you want the behavior of the system to be. So unlike one of the earlier aspects where clients were using pgdb.service.console, what you saw in the demo that I did up front at the top of the talk where it used query console. That is the lookup path to go and perform a DNS query using a query interface. So these are very simple things. It's all specified using JSON. In this case, we created a new prepared query called GOPGDB Slave. What this does is this went and looked up inside of each of the data centers. The name of a service called PGDB. You specify that you want to fail over to the nearest three data centers. Stop looking if you can't find anything closer than the nearest three. You can specify the name of the data centers, you can constrain it. And you say, I only want to look up things that are in the slave state, right? Or have the tag slave. And so what this means is you can go and perform that query, GOPGDB, there should be a hyphen in there.query.console, and it will resolve whatever is closest, right? This provides you with that safety net as a database administrator to talk to whatever is nearby, right? So this is great, but I don't want to have to specify one of these prepared queries for every single database instance that I have, right? Maybe I've got in a microservices world, and I've got 1,000 database timelines that are running around inside of my environment. This becomes really problematic to go and manually administer having to create a prepared query for every single customer or internal customers database. Maybe I've got one database per microservice, who knows? So that's less than friendly, so we can do better. So we've got a prepared query template, and this is where things get interesting. So in this case, we have a prepared query template name called GODB, name's reasonably arbitrary in this case. And what we do here is we have a regular expression that's hooked up into the DNS server effectively. And what this does is this anchors into a lookup, and we see that if we make a query for GODB customer master, that it will perform that regular expression against it. And then it will perform the lookups for the service and the tags dynamically based off of the input that came in through the DNS interface. This is really powerful because this means that with this one query, you can, as an administrator, deploy n number of databases with n different tags. It actually doesn't matter at this point in time. And you've now got this safety net installed on your organization so that as you do administration, you bring down a slave, let's say. You bring up another one someplace else. You never have to go back and change a developer code. You don't have to do, this is all entirely within your control, as long as you have some base infrastructure policy set up and specified in advance. So in this case, if I've got, let's say, the customer slave database, right? And maybe I want to change the name of the tag. You don't have to go back and change console. You just have to go and figure out what rules you want to apply to your organization, specify it declaratively, and then go in and push that information out into your organization. So you can do this for masters, you can do this for slaves, just works. Now, if you wanted to do something that didn't use the regular expression upfront where you say GODB, and you just wanted some blanket catch off for everything, including web services, including whatever else. Anything inside of the console domain you wanted to fail over to a service, do a different, fail over any service inside of console to a different data center. You can do that with a catch all template. In this case, it just maps everything to whatever it was that came in. So as I mentioned earlier, that under the hood, we build out this network map, using network tomography to figure out the distance between all nodes inside of a particular land gossip pool, right? This network map is per data center. Each data center has its own map, and it develops and evolves this map over time. So you don't have to do anything. It just happens passively in the background, and it provides you with whatever's local. This information, for all intents and purposes, is current within 60 seconds. We assume that there's gonna be a certain amount of jitter in the network, and we try and isolate you from knowing that it's happening. So console, what is it? Service discovery, it's a KV store. It performs health checks. It's data center aware. And there's a good ecosystem of tools surrounding console that provide additional functionality. I did finish early because I wanted to go back to the demo if there were questions or poke around at that if there were. Because the ability to fail over, there's lots of nuance for each organization, and some people have different pain points that they're interested in asking about, so. Questions? Shy? Go ahead. That's correct. Yeah, so yeah, so actually that's a great, sorry, that's a good question, and good point. So one of the things that happens there is console actually by default only binds to your RFC-19-18 addresses, period, right? So if you have two interfaces on there at console, you can explicitly specify, but it only performs gossip on the private IP addresses. And it only performs gossip on, yeah, I'm sorry, good, so the question was is what about console instances that have or dual home that have a public IP address and a private IP address? And so in that case console only listens to the private IP address, RFC-19-18 address space. So it only maps the network distance between all of the private IP addresses. If I have a public IP address, it doesn't care, it's not gonna use it. There are some things in here for the way that you would potentially want to advertise for WANs and VPNs and netting firewalls. I'm not gonna touch that. Other questions? They did wanna get this. The integration with DNS mask in this case is very simple. You just specify that each host listens on 127.001. You specify that I want to forward everything that's in the console top level domain to my local agent on the same box. There's a DNS server that I pointed out earlier. It's listening on port 8600. I do a reverse lookup so I can look up the actual node names. And I forward off everything to, in this case, open DNS's name servers. 2.node.console, this provides me with a lookup directory for every node. This is actually what I used when I did the postgres replication. I just specified VM node, or VM1. I can look this up, or lookup information across data centers. Lab 2, and it performed a relay from server to server between console cluster to console cluster, go ahead. That's just using binary streaming replication, yeah. Yes, that's exactly what it is. And I used a Nagios check to perform. I just put this demo together using the check postgres. I'm around a custom query and I say, hey, are you in recovery? Yes or no? And if you are in recovery, then you're healthy as a slave, right? Cuz that tells me two things. I can talk to the box and it's responding to SQL. So what that would do is that would pull it out of the slave pool because this check will fail now. If the master fails, the master tag will just kind of disappear. So if I did dig short, master dot. So the question was what happens when the master fails? PGDB.service.console. So it will simply just retract the tag. There's nothing at no service endpoint that is advertising that it has the master tag. So if the master fails, then that DNS query is gonna NX domain for a period of time. So go ahead, next. You can, there's nothing preventing you from adding a check to go do that. So in this config snippet here, you can see it's checks, right? That's an array of check objects. So if I had another check in there that said my replication lag, monitor if my replication lag is more than two minutes, let's say, behind or whatever you want it to be, depending on whether it's local network or when. The first check that fails will pull that entire service out. So even though you potentially have lots of checks that are generating telemetry information, this is not a good place to collect telemetry information right now, right? Instead, what you wanna do is just perform the basics for health, is this service alive, yes or no, right? If it's alive, great. If it's unhealthy, that's okay in the sense that it's struggling under load, you still want it to return true here, right? And that the service is healthy. Next question. Yes, yeah, so there's some interesting implications here for actually for second quadrants, the BDR stuff that second quadrant has, where potentially have multiple masters. Because that fits in very, very well with this type of model. You don't have to worry about the promotion from slave to master to slave. So to repeat the question though is, or the example that he gave was, they set this up in their environment and they did something where they had a check to check the number of connections. And if the connection count was non-zero, they assumed that there was a failover or if it was zero, there was a failover, something to that effect. And it caused a spurious failover, yep. But we're warning them not. Yep, yeah, so you don't want it to be overly aggressive here. You really do want this to be, like I said, very binary, healthy or not. Other questions? This is supposed to be very simple to set up in this regard and very easy to kind of drop in and just basically works. I did not, I didn't actually get into that. I wanted to point out that in this case I was doing replication or I specified the host name using the node address, that was all. So there are, yeah, that's a really good point because I've actually deployed that in the past. The console has a concept of a lock where you can get a network-wide name to mutex and for a particular, a named mutex, so that you can prevent multiple database instances from being promoted to become the leader. Very, very useful, very, I will say, nuance. You have to understand how your application and your environment's going to behave when you use that. Yeah, the state transition from slave to master master to slave is difficult. But you do want to orchestrate and coordinate that somehow. Anything else, are we good? I'm happy to take questions afterwards, there are others. Thank you.