 Uh, particularly because you're only about half a demo given the technical problems that they're having. Um, the, um, I want to go ahead and link that GitHub repo, which has a bunch of my examples of how to deploy clustered Postgres and Kubernetes. It'll be continuing to get updated, yeah? Oh, that's a good question. Do you guys have any idea how we can dim these? Also, I think I can make it a little bit brighter. Well, that doesn't affect the projector. I have no idea how to dim the lights. Yeah. Yeah. For me, this started out, started out around two, two and a half years ago. Um, and what I was looking at is, you know, now that Postgres replication is rock solid, I want to have fully automated, high availability Postgres. Like, there's no reason why that shouldn't be a thing and not just a thing for consultants, but a thing that just like anybody can install. Um, and the, um, and this led me into working with this thing called Docker that was new at the time, um, because Docker made a lot of promises for building the new cloud and that sort of thing. Um, and containerization involves some encapsulation concepts that I thought would be helpful for building a new high availability Postgres. Um, but, um, I immediately ran into a problem, which is, I go to the first DockerCon and everybody is talking about statelessness and stateless that, um, you know, and not only are they talking about it like, oh, we only do stateless containers right now because, um, because it's all we know how to do and we'll get to stateful containers later. No. They even made it like a philosophy. And that kind of felt to me like this. This is actually in the Brussels main train station and I just stopped and looked at it. You know, we have an automated system that takes us two thirds of the way, which is about as good as not having an automated system at all because you still got to do the other third. So, that led me to actually working with something called Kubernetes. Let me, Kubernetes shirt on here. Um, the, um, and, um, because, um, the Kubernetes folks having come from Google ops and that sort of thing were a lot more into looking at production workloads and not just stateless development environments. Um, and that eventually led to my, resigning my job as opposed to as a Google consultant and moving on to work on this thing at Red Hat called Project Atomic, which is what I do now. And Project Atomic is a bunch of container and immutable infrastructure underpinnings for, um, what we believe is going to be the new way that we build the cloud. The, um, uh, as well as a bunch of other things. So, the, um, but getting involved in Kubernetes, one of the things I immediately got involved in was something that was a new effort in Kubernetes at the time, which was this thing that was called Petset. Um, and the idea of Petset, it was called Petset, it got renamed to Stateful Set when it actually became a feature. Um, the idea of Petset was that we wanted specific constructs in Kubernetes that would handle stateful infrastructure, um, for things like Postgres because we realized that fundamentally there were some ways that you had to handle stateful services differently from how you handle stateless services. Um, and they needed different sort of provisions. Um, and so this became part of Kubernetes in 1.4, it's still considered data. Um, it's getting pretty stable. There's some little bits that are missing. Um, but, but they're not critical at this point. So let me back up here. Cause this came down, I started there and I said, okay, well, you know, hey, you know, I started when containers. But you wanted what's high availability Postgres, then how does that lead to containers? Well, there are a couple of things. One of the things is that I discovered that an awful lot of what DBAs have to do, um, on Postgres consists of installation of bits, you know, extensions, everything else, combined with configuration. Um, and even if you do these in, um, even if you use configuration management systems, then you just spend a lot of time tweaking the configuration management system. I mean, a lot of these big installs that was in there, you know, you've got three puppet masters on staff because they have to be on call. Um, and the, so I wanted to get rid of that by having a deployment method that took care of more of the configuration. Um, and so this is one of the things that containers help get. The second thing is, the idea of service encapsulation, the idea is that if you're deploying a Postgres container, your interaction with the Postgres container is to port 5432 and that's it. Um, not, oh, I'm going to fix this by shelling into the system and tweaking something. Because then that later on becomes a downtime event when someone else works on the system who didn't know you shelled in and tweaked something. Um, now, everybody feels lip service to the idea of preventing hotfixes, but everybody does them. Um, and having a system that made it a lot more challenging, that made it easier to not do a hotfix, um, would help. Um, also the business of being able to actually move containers through a dev pipeline, move database containers through a dev pipeline and the same way that we would application containers is also attractive. But the really big reason, the reason why I cared about any of this, that that was all pretty minor and could be done in other ways. Um, the reason I really cared about this was something called orchestration. So the idea of orchestration is that an orchestration framework gives you the underpinnings of a distributed system and all the distributed system logic and then you can plug your particular application into it. Um, so like among the things that are important, that otherwise I would end up, because like I worked on this thing a couple of years ago with my co-workers called HandiRap that was this complete master controller system in written in Python in order to do high availability Postgres. And one of the things that a lot of code went for is like auto-vivification, right? I need to make sure that Postgres is running. Um, I need to manage routing between the various Postgres nodes. Um, I need to make sure that failover happens automatically. Um, the, um, and I need to do a lot of other things that I need to be able to look at in the state of the whole cluster somewhere where I can monitor it, etc. And so orchestration frameworks actually give you a lot of that. It gives you stuff that otherwise you end up implementing yourself in like, and I will tell you from experience, literally thousands of lines of code and it's code that's very hard to test adequately. Because testing it adequately involves creating failures in your network. So, um, the other advantage of orchestration is we're starting to get a lot of new Postgres clustering stuff. So, not just, um, you know, various forms of binary replication and also the new logical replication but then even more interesting things like CitusDB, the charted stuff, Postgres Pro is working on a multi-master, uh, synchronous Postgres cluster thing for certain uses. It's got BDR, um, uh, Postgres Excel, some other things. And one of the other things that we need to do is we actually need to test these new clustered Postgres. And it's very hard to test those when a person has to spend an hour and a half setting it up in order to break it. You know? And so having a way that we can automatically deploy these sorts of clustered Postgres systems means that we're going to get them to production quality a lot faster. But the big, big thing, the sort of encapsulated thing for orchestration is the idea is to manage Postgres QL as a cluster ecosystem that is manage a Postgres cluster as a single thing instead of as a bunch of individual servers that just happen to be connected by replication links. Now, there are some drawbacks. Um, for example, deploying Postgres containers means that you need to package Postgres in containers. And there's a lot of extra stuff involved in that. And then you'll discover that a lot of your standard admin tools might not work with a containerized deployment or may require a lot of adjustment to work with a containerized deployment. For example, monitoring by tailing the logs becomes more complicated in a containerized environment. Um, requires extra tools. Um, there are some trailing known sort of performance issues around things like memory management and namespaces, etc. Um, that trips some people up. Um, also, one of the big sort of selling pitches, like I was just, we had somebody yesterday morning who was giving the whole Docker rundown and part of their sales pitch was being able to pack a whole bunch of containers on one server. Yeah, database is not so much. Right, we're already using all of the hardware resources. We're not getting any efficiency gain by putting Postgres in containers. Um, the big, big drawback is all of the technology I'm talking about is new. It's all new. It's all heavily underdevelopment. Um, so stuff does break a lot. Um, sometimes APIs change. They very seem certainly in Kubernetes. They tried not to do that, but the operative word here is try. So, what that means is for what I'm going to talk about and what I'm going to show you is this is appropriate for you now if you're already working on building a new infrastructure for high availability Postgres services, you know, that will go into production in, you know, six or nine or twelve months from now. Um, this is not something, this is certainly something you can immediately play with and be good for development if you're working on one of those clustered Postgres systems. But it is not something for you to put into production today. There are still some issues. Now, there are exceptions. Like, one of the big workers in this has been the company Zalando and they actually have put a bunch of this stuff into production. Mostly because the alternatives were even less reliable. So, they've been able to put in a lot of work in order to work over the parts that are still under construction because they didn't have other good alternatives. And that's the reason why a lot of us are working on this. There aren't good alternatives. But if you don't have a pressing need like that, then this is something to start playing with now so that next year, when a lot of it is much more stable, you can deploy it. So, I don't know how much I'm going to be able to show you because Internet problems. The, I'll, I do a lot of testing. This is actually my microcluster. I'm going to have it in the redhead booth. It's a stack of Intel Builderboards. And I actually started out to test some of the stuff because I know from lots of experience with the DBA that no matter how many times you kill off of VM as a way of testing failures, it just doesn't behave the same as an actual hardware failure. And so, I went ahead and built this so that I could actually like yank the power from stuff. The, and, but I've been actually sort of deploying on this. So, I'm still working on, you know, a bunch of the stuff. We've made some changes to stateful stat and that sort of thing. The one sort of production implementation you can look at is Zalando's one called Spilo, which is based on Petroni, which is something I'm going to show you. The, and I'll show you the link for that. So, let me see if I can actually log in. Okay, that is still a problem. Why am I not? No. Ah, there we go. Thank you. Sorry, I just needed to get the internal address to the, I just needed to get that. That was all. A little bit shorter. There we go. Okay. So, I've got what I, what I'm logged into here, by the way, and we were just looking at this. This is a seven node cluster that I set up for running Kubernetes on top of Fedora Atomic. The, I'm actually using plain Kubernetes instead of Redhead OpenShift because I need some of the features in the most recent Kubernetes. Like I said, everything's under development. The, and I've set up a seven node cluster and I've actually got a couple of things running there. So, one of the things I actually deployed ahead of time is a seven node Citus DB cluster. And the reason I did that was I wanted to actually load it up with some data, which it spent all last night doing. So, I didn't want to do that in front of you because you wouldn't get to see the rest of the demo. But other things that we can actually deploy faster. So, the, actually, so let me show you some stuff here. So, this is the Atomic DB repo. And like I said, again, this is designed to be, and this is designed to be a learning example. A production example would need a bunch more stuff. But I tried to keep it as simple as possible so you would actually see the parts of stuff that essentially needed to be deployed. And what I have here is some different deployments in various states of completion. The two ones that are kind of ready to go are Petroni and Citus. So, what Petroni is, is it's a system that automatically manages failover and routing in a single master Postgres system. It's the, the lineage that goes from compose.io governor. And then I worked on it with the Zalando team. They did most of the work in making it a robust system in Python. And that's been out for a while. For any of you who have gone to any of the European Postgres events, you'll have seen Zalando do some presentations on the same code. The, so, and there's a few things that actually get set up with this. So, is that readable? Not great color contrast. The, but, so, and a lot of this is actually just sort of going through the setup of how, of, of some Kubernetes things. So the first thing that we actually did in the first setup of Petroni and the thing that actually was really critical that we got out of this whole new stack, because one of the things I said is, is with the whole sort of new container stack is that actually the containers are the least important part. What actually made the new, has made the new stack worthwhile from my perspective is a lot of the tools. So one of those, the most important tool are what are known as distributed consensus stores. So for years and years we have this thing called Paxos that was this academic theory that got implemented for a few high-end telecom systems. Nobody else did it. And then this one team decided to make a simpler version of Paxos called Raft. And they deployed it and it's, you know, the initial prototype was written in JavaScript. To prove that you could actually do consensus and you could do it in a way that normal people could hack on and could be packaged up as part of another application. And since then things have just really exploded because the important thing about a consensus store is it gives you multi-node information that will remain consistent even in the event of a network split. Now for those of you who know about the CAP Theorem, we're doing ICA. Here, consistency and availability but not partition tolerance because if a node gets cut off it stops being part of the consensus. The, and if enough nodes get cut off, the consensus thing stops responding. But the answer is just to have enough nodes. So the consensus thing we're using is something called DTCD which is the, I, or not. There we go. Okay. Let's see if this is still responding. Where am I here? Okay. So it is something called DTCD which is the simplest of the popular consensus stores. It's probably got the smallest set of features which is one of the reasons why it's simpler. But for cases like this where all I need to do is have a reliable way to determine which Postgres node is the master, it works fine. So I'm deploying a three node DTCD cluster on my cluster so that I actually have a way to determine that for Postgres. Okay. So that's up and running. So that is the first part of it. Now the Petroni team is experimenting and we're actually testing out having an embedded consensus mechanism within Petroni itself that would go on the Postgres nodes. It doesn't matter MMSync, which I haven't finished getting a prototype of, has that inside Postgres. And there's a kind of trade-off as to whether you want it external or internal. So now that we've got our consensus cluster up, the second thing to do is to actually deploy a bunch of Postgres nodes. And I'm going to let that go while I explain a few other things. So one of the other things that this kind of led to of Petroni and a lot of other efforts was looking at how we handle applications that need to be stateful, et cetera, in a containerized cloud in a reliable environment. And one of the things that I actually realized, at the same time that a bunch of other people realized it, and so I'll give it different names, was that it's not enough to have a smart central orchestration system. As a matter of fact, it really doesn't work because the problem with a centralized orchestration system taking 100% of the work is that you always have this sort of phone home problem. You always have this game of telephone problem where you know that the database was up the last time you pulled it, but you don't know what's happened in between. And so you could actually never make a good decision. There was a wonderful blog post written on this years ago called Why Gangsters Can't Scale about the fact that sometimes you actually do need synchronous communication. And so like the joint guys called this the autopilot pattern, I called it bots not cattle. The joint guys called it the autopilot pattern and other things. And the idea is that each Postgres node needs to have a governor that is a simple state machine that will do the right thing based on cluster conditions. So an example of that, you know, are part of our logic. And in the case of the Petroni deployment they're doing right now, that governor is Petroni, which at its core is a simple state machine that says what cluster am I supposed to be part of? Okay, I'm going to connect to the consensus store. Is there already a master in that cluster? If not, then I'll become the master. If there is a master, then I'll try to start replication. If the master has gone away, then we will have a leader election to see who becomes the new master. And following through that, and that works because it's happening at the database node level. So you don't have this loss of communication that you have with a centralized node. What are we doing here? Whatever. Oh, I know why it's not coming up. Because I skipped a step. Because more Kubernetes things, as soon as my console starts responding again, this is the step I skipped, which is to actually give it some hash passwords. And so it's waiting for the passwords. And it will patiently wait forever. And I will impatiently wait for a very short time. From the Kubernetes perspective, this is actually pretty simple. This uses, again, this provision in Kubernetes called stateful set. And every stateful set has two required parts. One is a service. And in Kubernetes, services are around the software-defined network, how you want to end the discovery DNS, how you want to have services be accessible to other services and possibly to the Internet. And then you actually just set up this thing where I tell it, I'm deploying Petroni, I've got three replicas that I'm deploying, and I'm getting them from this Petroni container that I've rolled up. And then I'm going to set a whole bunch of configuration variables. Now Petroni actually exposes two ports. It exposes 5, 4, 3, 2, which is obviously Postgres. And the second thing is it exposes a REST API that serves two purposes. One is it lets you send some control signals over that API to individual nodes if you need them to do things like shutdown cleanly instead of just killing off the container. And the other thing is it's part of the leader election process because we don't want to just have a generic leader election because for Postgres asynchronous replication we need to know which one is furthest ahead. And so we do that by pulling the API in the individual nodes. So let's see if I actually have, yes, okay, so let's see. Now one of the things that I actually talked about is sort of the state machine. So let me actually show you the second one which is the Citus Pet Set. So Citus DB is charted Postgres, right? You have one or more query nodes and you have a bunch of nodes that hold charts. And from that perspective it's actually a much simpler setup for the governor. So if you want to actually look at what's kind of involved in creating one of these sort of simple state machines, looking at the Citus Pet Set example is actually a much better example because Petrone at this point is a few thousand, I don't know, 10, 15,000 lines of code. A lot of that is compatibility with individual systems because Petrone supports multiple consensus stores and PG Rewind and a whole bunch of other things. The Citus one is a lot simpler example for if you're doing this on your own because the single file that just explains, hey, we're deploying the cluster, we're going to go ahead and set up our Citus cluster. And basically, so one of the things that Pet Set does is it sequentially numbers all of the individual, my stateful set. It sequentially numbers all of the nodes that we're deploying. And because of that, in Citus I can just arbitrarily decide that node zero is going to be the query node. And also because I know from this file how many nodes they're going to be, I can automatically set up the charts so that the user doesn't have to do that. Now the user has to shard their individual tables but the shards are connected from the get go with the master. And it just makes the deployment a whole lot simpler. And I also do things like in here message the Kubernetes system to tell the Kubernetes system about the relationship between these which then allows us to do other things. Can I actually show any of this? Let's see. I cannot. So from there, you know, then we have the setup here where we set up our individual information. You know, usual again this is just really passing configuration variables to the system and setting up storage because it's containerized infrastructure. You have to tell it what the storage is. But one of the things that I can actually do is I can make use of the Kubernetes discovery system which is not limited to URLs. I can tell Kubernetes I can create an arbitrary label on a node. Like in this case, when node zero comes up it assigns itself this Citus role of query. And then I can actually create a service that says when somebody wants to connect to Citus connect to whichever node has the label of query. And that means actually at run time I can actually change which node is the query node. Or in the case of Citus, more likely bring up a replacement query node because you don't turn a shard into a query node inside us. Let's see if I can. Yeah. Now stuff should actually start deploying. We'll see. I do have the Citus cluster up and running. So maybe I'll show you that in a minute. Well, now there are some other extra considerations you have to pay attention to when you're running containerized Postgres. One of the big things is the activity log. You can't send the activity log to rotating files, but then you actually need to put them on an external volume and you need to do something with that external volume. So that's one option if you're actually dealing with that. The more standard way to deal with containerized infrastructure is that you actually log to standard out and then have some kind of a log manager pick things up from standard out. However, there is one big drawback to logging from Postgres into standard out. Do people know what it is? No CSV log. CSV log only works through the log rotator, which is something I would love to see get fixed in Postgres. You need a bootstrap script, this sort of governor thing because you don't want, you don't just start up Postgres. So I don't know if my struggling made it clear, but for example, when you define the container, let me show you the Docker file here. So when you define the container, you're not starting Postgres as the container process. Every container has a process that is what that container is supposed to be running. And you're not starting Postgres as your container process. Instead, what you're doing is you're calling this entry point script. Now, in my case, the entry point script is a thin shell wrapper around the Python script. And so this is actually our PID 1, and that's critically important in these sorts of systems because you never, ever want to have the possibility that Postgres is running and the governor is not running because that's where you get all kinds of problems. So with the governor, you know, whatever it is, with the Petroni, with just my entry point script for Citus, being the process for the container, I know that if that governor program shuts down, then that Postgres node is not running. And I don't have to have lots of code looking for that problem. Another thing that actually comes up is storage. So in a containerized stack environment, you don't just store things in the local file system because the local file system is part of the container and if the container goes away, those files go away on top of which performance on layered file systems is terrible. So, and I've actually realized that there is no one-size-fits-all for this. If you've got small databases, if you have a whole bunch of, like, multi-tenant databases that are supporting people's web applications, you don't actually want to use network storage for those. What you want to do instead is use local storage on the nodes on the individual machines or virtual machines and then make lots of replicas because it's easier and faster to manage Postgres replication than it is to manage network storage. However, if what you have is a 16 terabyte data warehousing database, the lots of replicas strategy is not real good for you because, among other things, that gets really expensive. Plus, you know, if you're on the Amazon, they oddly enough don't have any nodes that come with 16 terabytes of storage. So in that case, you're going to want to use network storage in the Kubernetes world of what we call persistent volumes. So, some other things that need to happen for sort of production infrastructure that are not in my demo. And then, let's actually see... Yes! Oh, we have a system. Yay! So, we'll actually look at... So, we'll have a little wrap around, but... So, you can see here in the Petroni 3 node single master cluster, we have a master in two replicas. The... I'm going to actually manually kill the pod because it's a lot faster than waiting for it to notice the machine timeout. So, besides killing off the machine, we'll also trigger the Citus cluster. So, okay, and that'll take just a second to actually flip over. Actually, here's the fun thing. We locked up again. So, there's some other things that we need. So, better configurations for different kinds of persistent volumes. One of the things that I was mentioning, that everything is still in development in these buggy issues, particularly for a stateful set or around persistent volumes, because the way it's supposed to work in stateful set is that you get a dynamically allocated volume share that is assigned to that particular individual container and that if we restart the container somewhere else, it tries to reassign that volume share on network storage. The problem is that right now, that's working for Google Storage and EBS and pretty much nothing else. So, that needs a little bit work. I do actually want to build up some examples that use HashingCorp Vault, which is how you would actually want to do the database passwords, as opposed to using Kubernetes-built-in-password-hashing thing, because those are hashes, it's not encryption, it's not secure. You also need a bunch of what we call sidecar containers. So, for example, one of the questions is if you want to take a periodic PG dump, the way that you actually do that is you have a container that runs PG dump. Who has a good example of this? CrunchyData actually has a high availability Postgres system built on an earlier generation of Kubernetes before stateful set, but they do have a lot of these sort of sidecar containers, etc. And then I'm working on the failover stuff for Citus because the idea of Citus is we have shards and we have duplicate shards and we have the query node and I want to have automated failover for all those things, but it gets a little bit complicated when we have these different roles. And part of it gets to a discussion of some changes in stateful set. Also working on building a few other types of Postgres, the Petroni with the built-in consensus, the multi-master clustering, the new governor, which is like Petroni only written in go, and probably Postgres XL just because it ought to be possible to deploy it. Let's see. Oh, following the other one. Let's see if we can actually look at the changeover point. No, we lost the changeover point. Sorry about that, but if we actually look at, there we now have, we automatically failover the master. And yes, this actually does work when you're sending data and writes. We are talking about asynchronous replication, so there's going to be some data loss. But we do a lot to actually minimize the data loss. Let's see. Oh, here we go. So let's actually look at our Citus cluster here. Well, let me finish up the slides, and then we'll take a look at the Citus cluster. There are a few things that we're waiting for in Kubernetes development with issues and that sort of thing to really make this an out-of-the-box experience in terms of deploying Clustered Postgres. And of course, everybody's welcome to help. So I've got a whole bunch of links in the presentation, but let me go ahead and... So we are actually in a Citus cluster here. What is the Citus? Okay, why is this... Okay, maybe not. I was going to actually show you running some queries on here, but for some reason Google is being unkind to me and not bringing up the page of queries that I want. There we are. Here we go. We're not. So what I've done here is this is a demo that was actually put together by the Citus folks a while ago of just slurping down a whole bunch of data from Wikipedia, actually the edit stream. So Wikipedia is a public edit stream with everybody doing edits on Wikipedia. And we did the edit stream, pulled it down. I ran it for like, I don't know, 14 hours last night. So one million edits and in 12 hours of Wikipedia. And that is a distributed query running across all six of the shard nodes. I've actually got 12 shards distributed among six nodes because they have duplication of shards. So the... And there's a couple of examples of, for example, searching for edits by bots because there's a lot of bots that are manipulating Wikipedia. And another one of the fun examples that he has here is trying to estimate the number of changes made by bots as opposed to changes made by real people. And it's nice and speed. There's about, I guess, there's about at this point 18 gigabytes of change data in this database. And I am running this on T2 medium nodes. So none of the individual nodes have, they all have 4 gigabytes of memory. But because we've started it, they can all return it out of memory. And you can imagine how that would work in a realistic size of instance in a bigger data set. So it can be nice and speedy. But the important thing for my presentation was actually just showing you that the idea is what stops a lot of people with systems like CitusDB is they're like, oh, this looks really cool, but how do I deploy it? And with the example that I have on the AtomicDB site and that sort of thing, you can actually, if you've got access to a Kubernetes system the easiest way to get immediate access to Kubernetes system is, well, right now is going to be GKE with Google. Once we get OpenShift v3, once we get the next release of OpenShift v3 out, it'll also, it'll be OpenShift v3. The OpenShift online is just, it's an aversion of Kubernetes before a stateful set. So my example is we'll work on OpenShift online right now, but wait three months. And you can just go ahead and actually deploy your, you know, Citus system to go ahead and try it out. So thanks a lot. So we have questions. Wait, somebody want to, somebody may mind running this around so we can actually capture the questions. So you mentioned that there's some data loss due to failover. Am I right to say that's just during the milliseconds when the flip occurs? So this is just, it's because Petroni is using Postgre's asynchronous binary replication. So the asynchronous part of asynchronous replication means that we don't wait for confirmation of the replicas. So a commit that just, that just happened on the master, right? You know, 10 milliseconds ago won't necessarily be on the replicas yet. The, and that's the trade-off, right? Because Postgre supports synchronous replication. And if we use the synchronous replication system, like I said, I'm playing with Postgre's Pro's MM Sync, which is a multi-node synchronous cluster. The latency in an individual write is quite high because you've got multiple network ops, right? You have to confirm it on the master and confirm it on the replica and the message just to get back to the master and then back to the client. So that cuts into, adds to your latency and cuts into your write throughput. But that guarantees that there is no data loss and a failover. That's a trade-off system. Similar to the Petronius system is joint manatee and there's another one in that line. And those do use synchronous replication because they're more concerned with preventing data loss than they are with performance. In the case of Petronius, we were building it for systems where we were, a large part of what we were looking to do was load balance reads across large numbers of replicas. And so for that, the preventing data loss for a synchronous replication was less critical. There was two different things where you were talking about production readiness. One was that this isn't ready and then I think somewhere else I was just kind of trying to capture what you were talking about said that, you know, so I guess I'm trying to figure out like what do you see needs to be done to make this production ready? So there's two different things. Yes, there's two different things. One is from the sort of underpinning standpoint, the Kubernetes community is still working on some stuff with stateful set. So we're still waiting for some bug fixes, particularly around persistent volumes. I am some other issues and for a lot of people just some more road testing by early adopters. So that's that side of things. The second thing is, if we're looking at just my atomic DB repo, that's not a production implementation because you need a lot more stuff, right? You need backup containers. You probably do actually want some kind of a connection proxy which I haven't included in there. You want a bunch of other things in order to make it a production infrastructure. And there's some other examples of that if you actually look at, oh here let me flip back up to my slides. So I've actually provided links for a lot of these things. So if you look at the crunchy data implementation, not the core stuff, the core failover is pre-stateful set and so we can do better stuff now. But all of their like sidecar containers and that sort of thing. There's, Zalando adds a whole bunch of their production stuff on top of Petroni in this thing called Spilo. That includes a lot of the sort of extra tools that you need in terms of monitoring, in terms of administration stuff. And then have an example of deploying that through this thing called Helm charts in the Kubernetes charts incubator. It's just, that's not something that's easily readable and I actually wondered an example that made it easier for people to understand Petroni and how it's deployed on Kubernetes and so that's why I've got that example. More questions? Is this deployment model, do you think best suited for an application that's also running containerized in the cluster with it or do you think a traditional non-containerized app would be also appropriate for this? I would say yes, actually, because the advantages of deploying the application layers on Kubernetes are actually much greater than the advantages of deploying the database on it. So, I mean likely if you've been considering this you already have applications that are containerized. Because particularly if you're separating out the layers and the applications and storing any data you don't have to deal with a lot of these issues around stateful set and that sort of thing, right? You just go ahead and deploy it into the container cloud. So yeah, and that's the idea because the other important thing is your ability to use the discovery features of the orchestration platform to connect the application to the database. I didn't get this far because of all of the technical issues. But in terms of failover, one of the other things that works automatically is having separate master and read load balance connections. So I can just use that master label that I showed you from the failover and connect to whoever the current master is by using the master label. And that way the application doesn't need to have any understanding the failover system. They just need to use the discovery URL to connect. And on a small scale, I can rely on Kubernetes own internal round robin load balancing to load balance the reads. So yeah, so that would be a big part of the advantage is to actually have the application there. Not only on the same sort of system but in the same orchestration cluster that you can make use of the discovery. Any more questions? I think I need to give time for the next person to set up. But I'm happy to take questions out in the hall and maybe even finish my demo for anybody who wants to follow me out there. Assuming that I can get the Wi-Fi to work, which is a big assumption. Thank you. They actually did... Oops, wow. The thing is that Heroku was doing this before. Okay, everyone. Good morning. Thank you for attending my talk. I'm excited to be back here at scale. I think it's probably my fifth time coming here. I get a lot out of this conference, particularly there's a lot of really good general talks on some topics that I need to learn more about. So when you sort of start looking at the schedule tomorrow, particularly when you kind of look... kind of see if there's something that looks like a good... something, hey, I was curious what that was. That's what the great thing about these conferences is you can come in 45 minutes and get a good sort of overview of a lot of things. What we'll be talking about today is a topic that's fairly narrow. In fact, it's a talk that I don't give very much because I think it scares a lot of people. It doesn't scare you folks, so that's exciting. What it is, we're going to talk about Postgres shared memory today. He's going to explain why it's important and I'm going to help you understand sort of how it's used and I'm hoping you're going to get a better understanding of sort of what's happening inside of Postgres and how it works. First I wanted to check is the volume good in the back? Can everyone hear me? Good, thank you very much. So we're going to talk about six things today and what I'm going to try and give you is sort of a window into sort of how Postgres works, how it stores data, how it uses shared memory, how it creates new sessions, how it does locking. These are all kind of sort of very, I think very interesting topics about how relational systems in general work. In fact, some of these topics are the reasons that I got involved with Postgres years ago. If you're interested in seeing these slides actually right here at the bottom, there's a URL and this presentation along with about 30 other presentations or is it that URL? If you type my name in Google or in the search you'll find it's a pretty unique name. Even if you get it partially right, it'll find me. And also there's a lot of videos of me giving presentations. So again, if you see a presentation you may even be able to watch a video of me giving it. People have given me good feedback that that seems to work out well. But today we're going to be talking about effectively how Postgres uses shared memory and again some sort of storage issues and sort of how Postgres does some things internally. I think it's fascinating. It's fascinating not only for Postgres but most relational systems use a similar type of sort of setup to work. And I think it's kind of fascinating that's what kind of drew me to Postgres over 20 years ago. I'd like to take questions from the audience. Sorry about that. So if anybody has a question we can quickly have to cover that. Let's get started. First off, let's talk about the Postgres data directory. Postgres actually stores all of its data in the file system. And that kind of throws people off. They're like, huh, what, why, what? There's some reasons for that that I'm going to go into but effectively all of the Postgres data is stored in a specific directory. Now it's usually not called flash data. It's called maybe opt Postgres 96 data or something like that. But there is a particular data directory and that's where all of the Postgres data goes. Postgres does not use raw devices, for example. So that's kind of interesting. A lot of people are thrown off by that. I get a question a lot like why is Postgres not using raw devices? Raw devices are nice. You can get a little bit more throughput for them but management becomes very difficult. Anybody who's ever used raw devices has a lot of pre-allocation necessary. I remember having to allocate certain sectors to the database and then none of the file system, like the operating system tools work on this raw thing very well. So the Postgres community basically just kind of got away with them. It's like a lot of the relational database are getting away from raw devices because in the old days you needed raw devices because some of the older file systems like System 5 release 3 file system were kind of crusty. Got easily fragmented, didn't really have high throughput but some of the newer file systems, EXT4, XFS are very good and ZFS and really there's not a whole lot additional we could get out of creating a file system for us. So we dive into this data directory and see exactly what's going on. Inside the data directory is a directory called base. It's kind of a funny name except it's a database, right? I know. Remember Postgres came from University of California at Berkeley, you know, all these sort of, you know, students kind of came up with a name. Wouldn't it be cute if we called it base and in fact that's what it's called. There are some other directories underneath the data directory but the one that is base is what I'm going to look at primarily. And inside the base directory you're going to see a number of subdirectories. Actually numbers. And each sub directory actually represents a database. So Postgres has had multi-tenancy if you're used to that term since 1980 whatever. We've always had the ability to create multiple databases in a single cluster. In this example here we actually have five databases defined in this cluster. And the numbers can be looked up. There's a way of looking up which directory number represents each database. And you can do cool things like if you just do a DU on a directory that's your storage for that database, right? So you can start to see by using the file system management administration becomes much more simple because everything's there. If you shut down the database that's all your data. And any table spaces you might use that's all your data. There's no special places we put it. There's no special kernel calls. It's all just kind of sitting in the file system. Inside the particular directory we're looking at which happens to be production we have another number in this case it's the customer table. Which again we can look up and figure out what particular number that is for which table. And pretty much everything in there is either a table or an index or some helper file whether it's the visibility map or the full space map, pre-space map which kind of helps us manage these indexes and tables which we internally call heap. Now you might think oh this is a lot of detail but we're going to go like much farther. So inside this table is a set of 8K pages Postgres uses the full 8K pages so effectively this particular customer table has 4 8K pages it's a 32K table. Okay we don't have that many customers yet we're new. But you get the idea where everything is stored in this 8K segment size. And the tables grow by adding additional 8K segments. Inside the 8K page you have a defined structure you have a page header then you have item pointers which are at the front of the table and then at the back of the table you have basically tuples which is another term for a row. Okay so in this particular page we have 3 rows and you can see that the item pointers point from the item pointer directly to the tuple okay so we have 3 item pointer you have 3 tuples they all point to each other. The reason these item pointers are useful is the index points to the item pointer not to the tuple so we can clean up and move around these tuples within the page without having to change the index at all. You kind of get the idea it's sort of an indirection by having the item pointer which are very small I don't know I think it's 2 bytes you can move around the rows and compact them and so forth as the rows are added and deleted but the item pointer stays the same place and the index points to the item pointer you get the idea but this is not all we want to talk about we want to break out one of these rows so just to kind of summarize where we started I started by telling you about the data directory then I told you we have subdirectors for every database then I told you we have files for every table or index then I told you the files are made of 8K then I told you there are 8K pages I showed you what a page looks like now we're going to go further down and we're going to look at what a row looks like or a tuple in this case so I'm pulling one out here and we're going to look at that particular row exactly what it looks like on disk or in memory because in fact Postgres uses the same structure that's what a row looks like so there's a row at the top here we have a header and a bunch of values in this case the row has 6 values on that particular row now you might think oh ok that means the table has 6 columns maybe maybe not ok because there's a particular aspect down here called the null bitmap so what I can actually say is this row has 6 on null values if any of the rows are null we don't bother putting a plot for it up here in green we merely record the null down here in the page header very efficient it can grow and shrink and so forth so nulls are very simple to store in Postgres one bit basically so if you have a sparse table it has a lot of empty columns in a particular row that's pretty nice we have a whole bunch of other information here this information particularly is related to mbcc processing in Postgres I do have an mbcc talk that you can look at on my website that explains what this xminxmax does and how all that other works and we have some other things how many attributes are in the row we have an info mask and an offset because there's an overhead there but let's take a look up at the green values these are six values and let's again we started way out of the data directory to the sub-directories to the table, to the 8k to the page layout, to the tuple and now we're looking at a particular tuple and let's look at what an actual row looks like so if you have a particular number and you want to insert it into this particular row the way it comes into that row is we actually comes in as a string because of course you send a string of sql query to the server it's not sent as a number it's sent as a string of digits, effectively and the way the parser works it kind of breaks out and says you want to insert this particular number it's made up of four digits the string gets passed into a function called int4in which is technically integer for input and we effectively store that value right in that column we call a separate function in this case we have a text value, not an integer and we want to output it so we take that particular value and we call a column called text out and text out automatically outputs to the user whatever happened to be stored there I know it's kind of like I think we're at the fourth or fifth level but this is effectively what happens in Postgres but it's even just to give you some some detail of this this is what the pager header looks like in C I don't expect everyone here to know C but you kind of get the idea we basically define a heap tuple field up here as a struct and then down here we have a union of the header and we have the info mask and we have the null bits and then we have a particular here's where we put the tuple field which is in max and this kind of all breaks out this big kind of wacky thing there so this is basically the way the system works and how it comes in one interesting thing I want to point out just while I'm on this slide is that when we store that number up there four digits we don't store four digits we store an integer of four bytes it makes sense because the integer may be seven bit bit you can't store that in four bytes so we have to convert that and effectively the way Postgres does it it actually converts the string into an integer as represented by the CPU and it just stores it right on the disk or in memory so I'm going to show you a little later how that works but effectively we're storing everything as natively as we can for that particular CPU unfortunately most systems store integers pretty much the same although some are little Indian and some are big Indian systems and effectively no matter whichever way your CPU likes its integers that's the way we're going to store it so I guess my point here just to close this out is that Postgres doesn't have like a magic format for integers or magic format for floats it does have sort of a special format for strings and things that are not easily represented in CPUs as easily but for native data types we just store them however the CPU wants to store them okay so let's talk about something else let's go into shared memory okay and I'm going to show you some examples of what I just talked about in a minute in relation to shared memory but I think this is a great slide I use it a lot I do a lot of training and I've had to give this every time I talk about shared memory I have to talk about it why do I have to show this slide because what this slide effectively represents is how Postgres as a process-based architecture is able to share state among those processes I'll say that again Postgres as a process-based architecture this is how it shares state among those processes now you might be like why are you doing that why aren't you using threads why don't you put every session in one big process and then manage all the threads and frankly that's the way a lot of the systems work I believe that's the way work works that's the way my SQL works in fact I think most relational systems actually work in the threaded model but threads have some disadvantages in fact if you think back of the 80s and 90s and specifically the 80s there wasn't a whole lot of use of threads and in the 90s everyone kind of started to use threads because process creation was heavy and there was all these disadvantages and CPUs, particularly Spark CPUs were better with threads and processes and system 5 release 4 is very slow with process creation there's a whole bunch of reasons when people start to move Windows is very slow with process creation compared to threads so you sort of had this architecture thread based architecture in Formix the mid 90s converted from to Formix 5 to Formix 6 and 7 from processes to threads but Postgres never did that and I would argue and again we can take some questions from the audience but I would argue that that was probably a good thing okay and the reason I think it was a good thing is that a process a threaded database architecture effectively is optimized for session creation sessions or threads are a little easier to create than they are processes and sharing is easier in a thread than in a process so you think okay I can create my sessions faster in threads and I can share easier in threads why wouldn't I use threads now there are some reasons first off the difference between process creation and thread creation this day is very minimal on most modern operating systems for example Linux thread creation, process creation pretty much the same there's a very small difference between the two of them effectively the difference between thread creation and process creation is that the data segment in thread creation is shared the data segment in process creation is copy and write for those people who understand those terms but what process creation gives us it allows us to do a lot of things within a PostgreSQL process that are very efficient and do not need to be shared so parsing, optimizing executing all most of what you do for an SQL session you probably don't need to share there is a significant portion you do need to share and the way PostgreSQL addresses this problem is to effectively create a shared memory segment so when you start the Postmaster which is that first process right up here we create that pink area and then every time you create a new session we use the fork file system system call and we create a copy of the Postmaster as a child and that child is a new session okay what's very interesting and I'm trying to sort of illustrate it here is that when you fork your program text is exactly the same there's no change you just get a new program counter and you're sharing the same page because we're all running the same binary okay so there's no overhead there you get a copy of the stack but you get a private copy as you grow the stack you're having a private copy of that you get a copy of the data but as you modify that data you get a private copy of that data so things like memory allocation are all done privately we don't have interlocking to try and make sure one thread if one process goes berserk it just dies we exit it okay as long as it's cleanly we know everything's cleaned up because they're the only shared state is here in the pink so effectively in my mind what we've done in Postmaster is kind of separated the shared part from the not shared part and because we've done that it wasn't easy to do but because we've done that effectively you have this very sort of shared area which you guard because that's your shared memory segment but there's this whole bunch of non shared segment the stack and the data which you're parsing things and doing optimization and rewriting and then executing and you've got a whole bunch of all sorts of stuff going on and that's all private okay and it actually ends up being very easy to manage very good performance which is kind of nice okay I don't expect relational systems to just give up threads and move the processes now but I would argue that this is a very very clean architecture very high reliability very easy for us to develop in because the shared part is very isolated okay and it gives very good administrative way it's very easy so you can do a PS and see all of the processes okay you can run top and top will show you which sessions are using most of your CPU or if you want to use IOTOP you can actually look at which processes are using a lot of IOTOP you can send a kill signal to one of the processes okay so you have that sort of very kind of clean architecture easy to administer your architecture and I think it actually is a benefit to us although again some people you know are surprised by it when you actually think of exactly what you're gaining and losing in threads and you look at this that's kind of interesting so let me take a break for questions again questions about this or questions about the data storage anything there yes sir I'm sorry there's no multi-threading no there is no multi-threading we do have the ability to do parallelism in Postgres but we do that through help or processes instead of threading the procession itself so we have the problem with multi-threading in an architecture is that you have to control like you don't want 100 sessions all creating like 10 helpers your system's going to go to good place so the way the architecture works and this is still work in progress we keep adding to we have something called background workers and those background workers kind of act as session helpers and they can do things like sequential scans and sorting and even aggregation so we actually have the helper processes that come along the main process and they take part of the job sort of a job schedule you send about similar to what you do in a job app where you have work too and you throw it to a bunch of workers it's the same kind of process for Postgres same architecture for Windows and Linux that's right and I would argue this is not a good architecture for Windows that's a problem okay in fact for people who are doing very short lived sessions we'll often tell them to use a connection pooler so you're not starting and stopping a lot of connections rapidly on Windows on other operating systems it doesn't seem to matter but on Windows we often tell people if they're doing a lot of start-stop use a connection pooler therefore your sessions will stay open you're not kind of churning a lot of create process because we don't have fork on Windows we actually use create process and do a whole bunch of maintenance to create a child that runs exactly like the parent it was kind of hard it works but I don't think it's as optimized other questions or someone over here yes ma'am sir I'm sorry what happens when you want okay so the question is if you update one of these values to know what happens like how do you fix it the way Postgres works is you create a new copy of the row and that row will have this bit set okay so the way NBCC works and again I have a presentation about is that it actually creates a new copy the old copy remains available for people to read and then the new copy is the new version of the row and that old version will go away at a time other questions yes sir what happens if a row is bigger than 8K we do have a solution for that it's called toast effectively any row it's about over 2K every table has a backup table called a toast table and what we do is we take this 2K row or 5K row or one gig row or five gig row which we can support and we'll take all of the big fields we'll compress them and we'll store them in the toast table and then we'll just put a pointer in the row slot so if the row has many small columns we'll still toast out as much as we can to get it down to about 2K we'll just keep throwing them out and I don't think you can have a table with more than I think is it 500 rows 500 columns there's some maximum number of columns and at a certain point you can't fit enough toast pointers to stay in the length and then you're like kind of toast it but you kind of get the idea and the beautiful thing about toast that I have a lot of blogs that I have probably over 400 blogs on my website but one of the blogs is about toast and the cool thing about that when you're doing a sequential scan if you've got like Postgres will accept up to one gig in each field that's when I said five gigs you have five one gig fields we can't have a five gig field it can be five one gig fields but what happens if you're scanning a table and you don't want to you don't care about that particular value we just hop over the pointer we don't have to read a gig a day there to get to the next row which a lot of systems do have to do and actually Postgres in a way is optimized to prevent you from having to manually deal with that it just kind of automatically fixes that by taking the large values and pushing them into the toast table and unless you actually query those columns you're not having to read through that long value to get to the next row yeah actually very efficient yes sir okay so the question is based on current process CPU architecture is Postgres only able to use one core for one query? historically that was true but in 9.6 we added parallelism so there is a we have a set number of helper processes so if you have 32 cores you may create 25 helper processes maybe okay they're called background workers and then an individual query and I have a separate presentation about this but an individual query for example it can do a sequential scan and use 8 background workers or 16 background workers and each one will do an 8th or 16th of the scan and it can actually do aggregates on those pieces and for example if you're scanning a large table you could do a count and a scan of 1.16 of each part of the table automatically happens the totals come to the originator and then it just adds the 16 numbers and that's the count so you have that ability and we're continuing to improve it I think in 9 we're going to have parallel index scans we'll probably be adding parallel index builds we'll be adding parallel partitioning there is a wiki page called parallel query and if you go to there you can actually see all of the like 12 things we're looking at doing parallelism for so Robert Hodges is working on this whole team working on it and every I think I say every year every new major release we get new parallelism capabilities but up until 9.6 it was a single core and that was a major problem not for OLTP so much for OLAP that was a big problem and I think that's probably the biggest missing piece to getting us to very large OLAP systems so the question is if it stores the integer using a certain layout big ending doing record that fact somewhere there is a PG control file which has some information about the architecture I don't remember if we store that or not but I will say that we don't care because when we ask for an integer I'll show you this in a future slide but when we ask for an integer we don't really care how it's stored it's just an integer in a CPU registered and we're going to add it or we're going to convert it to string or whatever but again it doesn't we don't really care what that native format is in almost every case because it's just 4 bytes to us and we do and we add when we issue an add instruction to the CPU we don't really care we may record somewhere we don't really care if it's a big ending earlier because it's going to do its math the way it knows how to do its math yeah I guess that's a good point is that a lot of these stuff are almost tied to what the CPU wants to do and we don't get in this way we don't have a predefined format that we wanted to do it in yes sir so the question is what would the issue be if it was from a little indian to a big indian system and it gets in data directory the answer is Postgres will not start because effectively the control file that I talked about PG control will actually look at the data directory and will make sure all of the settings match from the old one to the new one it will just say I can't run so it's not a portable format in any way it's really tied to that CPU you'd have to do a logical dump of the data to move to sure that's one of the advantages this last question yes sir so the question is when you're doing PG dump do you have an indianness note PG dump is a logical dump it's not a physical dump if you're doing PG based backup and you're streaming there is an indianness to that because effectively the system you're speeding it into was a binary copy of the original system so it has to have matching cores it has to match architecture wouldn't work but it's very tied very tied to that 3264 bit you're not going to take a 32 bit PG data directory and move to a 64 bit you're going to have to do a logical dump great question, thank you let's move forward and come back to that here's a very crude diagram the shared memory this is not exhaustive but you can kind of see a little bit of what's going on the shared buffers I'm going to talk about in a minute that's effectively where we do all of the I.L. of Postgres so all of the input output that's our big shared area because if a process reads data into its private address space and other people can't see it we can't synchronize those changes so when you're going to make a modification to read data you've got to read it into an area that everyone can see so we can remain coordinated that's what that shared buffers is for we also have buffer descriptors I'll talk about vacuum there is some coordination there between processes the background worker the proc array is another area that again is shared here's all the lock information which is in shared memory two phase commit, a whole bunch of other stuff the shared memory area and that's this pink area right here so let's continue drilling down remember we had this slide here where we basically drilled down to the row let's look at it from another direction let's look at the shared buffer cache so as I said the shared buffer cache is probably the biggest part of your shared memory which is basically our scratch space or all of the I.O. area for the database again it's shared, not in private memory and it's made up of a series of 8K pages which look exactly like what it came from disk so when you read a page from disk it's exactly the same there's no translation there's no sort of special format it's coming however we stored that integer in memory it's just that's going to be written to disk exactly the same way we'll read it back later because you can see there's a page it's 8K, it's got the three item pointers it's got the three tuples and again it's exactly what happens when you're reading you're reading from that file system right up into the shared buffer cache and when you're writing you're writing that 8K back to the file system in the exact same format that's sort of one of the real things I wanted to kind of get across but actually there's not a whole lot complicated here it's basically like we're going to store that integer we're going to store that string it's probably going to be in shared memory for a while in shared buffers eventually either as part of a checkpoint or as part of the background writer it's going to be written down to disk and then when we need it later we're just going to copy it back and it's going to have that exact same format and again looks exactly the same as 8K page each row has this format and basically the Postgres database is able to read this layout and in fact it reads it using basically C code using the native format for the CPU so here's actually an example of a C function called no cache get adder if you look at the Postgres code it'll look exactly pretty much like this and effectively what it's doing is looking through a row kind of hopping from one field to the next it starts at zero it's saying oh hey is this a no? oh if it's a no then I don't have a row remember we talked about this there's no row here it's just a bit so I'm saying okay if this is a no and if this row has no this table has no and that happens to be a no then just continue because we don't need to go forward there's nothing there if the length is minus one that means that it's a variable length field then we have to do something special but if not then we can just sort of do the alignment okay and figure out kind of how big it is like what's the alignment and if we reach the end we just hop out if not we kind of add the length you see it doesn't it's just incrementing a counter but you've got to take the offset and continue hopping out once you get to the value you want you call fetch at which effectively is saying I am now sitting in front of the field that I want to look at give me the value so it's effectively the tuple pointer plus the offset that we kept incrementing here I'm not making this up and in fact here's what fetch at looks like although this is the macro version of it so don't be too concerned but effectively it's saying hey I want to look at this value isn't an integer is it a 4 byte integer if it's a 4 byte integer then just cast it to an integer pointer and give me the value really nothing fancy here there's an integer with those 4 bytes get it for me and put it in a register or return it to this particular function do the same thing and then if not then okay we're going to have to assume it's a character string no I'm sorry it's a 1 byte integer that's what it's a 1 byte integer then assume it's a single character if not maybe it's not by value then we've got to kind of walk through it and figure out the pointer I know that was really fast but I wanted to kind of give you the whole flow there because you're going from you're taking postgres it's looking in that buffer and it's basically just walking that row and then once it hits the value it wants is it a 4 byte, is it a 2 byte, is it a 1 byte if it's none of those then it's a pointer and then figure out how big that is and pass it around there's 1.2 million lines of source code so I'm showing you a tiny tiny piece of it but this is a pretty cool part and I think it kind of illustrates the architecture of postgres and I think it's really interesting and the simplicity of it any questions? everyone's freaked out let's move on walking walking is something that relational systems have to do all the time as you can imagine there's a lot of shared state in a database system and you're always trying to efficiently make sure that you have exclusive access and what I'm going to do is I'm going to walk you through how postgres does that spoiler warning there's assembly code coming so just be aware of that shade your eyes it's going to be too much so effectively the way postgres does the locking is it takes a special and this is kind of cheesy but it takes a special byte in shared memory and you have a test and set instruction on it the thing you should identify here is this is pink and it's pink that means it's not in local memory it means it's in that big pink area which is that shared memory area so effectively the way the test and set instruction works is you take a one and you call the test and set instruction and if a zero comes out then you have a lock and if a one comes out then the test has it so effectively say okay did we get it yes we can continue on if we didn't then we failed and we have to somehow try again at some later point this is the sort of scary part I wanted to show you and I'm sorry it's not bigger I should have blown it up I apologize for that but effectively what we're doing here should be bigger anyway what we're doing here for x86 so this is not even a 64 bit call I think this is a 32 bit version but it's basically saying hey is there a zero is there a one there already and if there is then just forget it just fail right away if there is a zero in there if it isn't a zero then lock it and try and put a one in there and then if it succeeds then return blah blah blah whatever okay so this is basically the assembly language that implements that we have assembly language for 64 bit which is what obviously would be called AMD 64 because that's the architecture that originally implemented that we have power we have arm we have spark every architecture we supply native test assembly instructions but we also have the ability to use the compiler specific things like Atomics which are kind of new so it's depending on how things go there is a bunch of different locking types that we use but I wanted to just highlight this one because I think it's pretty cool that you got a little piece of assembly language a little piece of CPU specific code in everyone in all those Postgres botteries okay just to get a little further along if the right fails then you do what's called a spin lock and the spin lock effectively is saying that I'm going to spin meaning that I have failed and I'm going to go to sleep for a while and then I'm going to come back since the term spin you're spinning around until you get success so effectively here's what we do we say okay if we succeed we can we can go in a very way if we fail then we have to sleep for some specified period of time and then we're going to try again so basically the simplest one is test and set on top of that we build the spin lock okay and then on top of that we actually have heavy weight locks or light weight heavy weight locks which actually um where this is called light weight lock I'm sorry this is a light weight lock where we actually sleep on a semaphore if we're spinning if we think we're going to just spin too long then what happens is that you actually record a semaphore that you're waiting on something and you go to sleep on that semaphore and when the process releases it it basically wakes you up and then you can continue on so depending on how long we think you're going to wait either you're going to spin lock state or you'll go into a light weight lock state and you're waiting to be woken up so you're not wasting CPU waking up every couple milliseconds to check on it we also have a proc structure and shared memory and proc lock which controls is a sort of slot for every session so we have information like what databases are connected to what locks does it hold what's its process ID what is its snapshot visibility a whole bunch of information across all the sessions which allows us to coordinate very efficiently this is kind of what the proc array looks like it's basically either empty or used and then we have a proc array which is sort of all the busy ones so I'm simplifying but you kind of get the idea of how that sort of shared state works and finally again sort of just showing you kind of that whole pink area and kind of what goes on in the shared memory so what I've tried to do is to cover sort of the basis of what Postgres is doing the idea of how it stores data in the file system how it uses shared memory to kind of read and write those 8K blocks from the file system and then give you sort of another meta idea of how we create processes how we effectively lock things in shared memory to prevent things from being overwritten that shouldn't be and then finally talked about some of the proc structures and some of the more detailed stuff so that's what I have for today what I have about 5 minutes for questions so I know we kind of zoomed through that a little bit quickly so yes sir so the question is is there a way to determine what type of thing something is waiting on that was there was no what it was doing a spin lock or a lightweight lock so you technically could scrounge through the semaphore kernel setting because you can kind of tell the way it works each session or each process has one or two semaphores and you end up sleeping on your own semaphore basically so effectively if I'm going to sleep on a weight so I'm going to sleep on my own semaphore and when somebody releases that lock they're going to wake up my semaphore so they look at the proc array and they're oh hey this guy's asleep on my lock I'm going to wake up his semaphore and he's going to wake up the other thing you can do in 9.6 is there is in PG-STAT activity you can actually see what things are waiting on I just did a blog about it like three days ago four to five days ago looking at how the new 9.6 capability in PG-STAT activity allows you to see exactly what something's waiting on and that's really the way you should do it I realize you can kind of dig through stuff but it's much easier I think because it's very clearly labeled in the column at PG-STAT activity what something's waiting on and if that's not enough if you want profiling you can actually put a little session together that basically checks all the samples, looks at all the weights and then goes to sleep and wakes up again what's things waiting on and you can get like a graph over time to say hey 40% of my weights are on this particular lock you know what I'm saying then you get really complicated then you have to figure out what the lock is and sort of understand if you can improve it but it now gives you a visibility that you didn't have before in 9.6 other questions yes sir 25% of RAM is the recommendation one of the reasons it's so low is because Postgres will use the kernel cache as well so even though we're setting that pink area to be 25% we have access to much more memory in the kernel and effectively as memory demands on the operating system go up and down that's sort of an elastic cache we have a lot of ways it works better in varied workloads there's some logic to say well you're doing some copying from the kernel so why don't we make it really big you can make it really big but it makes the system very rigid so if you have unanticipated demands either for other processes in that operating system or you've got unusual queries that have a tendency to be using a lot of different memory then you could really get so that's the idea is you've got a fixed amount of 25 and then you've got this elastic it gives Postgres maybe not the best performance if you could tune it perfectly but it gives it probably superior performance over its full workload because databases have a varied workload that's the nature of the thing yes sir that's a great point PG buffer cache is a contrib module which will allow you to look at your buffer cache hit rate how many hits you get how many reads you're doing into that shared buffer cache and it'll actually get you statistics on that now Keith did you do a blog about that was it you that was really good so he had a blog on Planet Postgres last week that talked about how to get your percentage of hits and it's the number you said if it's 100% it's great there was actually another one last week that talked about if it's 90% it's great if it's 90% it's good if it's 80 or lower then you should be thinking about some things so there is some blogging about that to sort of get in there and see if maybe 25% may not be good for you but that's usually the starting point other questions well it seems I have one minute left so we'll finish early I hope you have a great conference and I know I'm going to enjoy the next couple of days so thank you and the default for this is also 8 and I haven't had to mess with this configuration personally so it should be fine so inside your PostgresSQL.configuration file you're going to want archive mode to be on and it's defaulted to off so it's really important to note that it requires a database restart in order to be put into effect it also accepts archive commands and if you choose to not use Wally there should be any command that can run on the server so you can have it be a script or a different tool I use Wally and it's been working out pretty well for us there's also a terse parameter there and the terse parameter basically just tells it to only log the most important errors to your log file instead of everything because otherwise it would be pretty noisy so I definitely recommend using that flag on the side of the configuration file so that we're just pushing wall files as we get them so one thing to mention is I would recommend not storing your secret access keys and IDs inside of your configuration file and instead using Enster because Postgres users can check the pg settings table and potentially access your keys and could be users that you might not want to see that information so as an alternative and it just allows you to have a file name inside of a directory that you specify that has the key or whatever secrets you need in the file and it'll just read from there it's part of daemon tools and it's available in Debian and you can write a wrapper script if it's not easily available for you so as you guys can see I did not procrastinate on my talk because this is a screenshot from S3 and these are wall files and it's great, very beautiful so once you have that we can work on restoring the database so to do this I have an automated restoration script and it just follows these steps it spins up a server it configures Postgres creates a recovery configuration file starts Postgres back up performs a couple sample queries to make sure that the database was set up properly and then notifies us on its success so here I'm just starting off this is a Ruby script, this is just to spin up the server so it's creating server and there's a little loading bar on the bottom as it's being created it's about to be finished and it prints out the instance ID as well as the public IP address so I'm just going to SSH in real quick there's a log file that I'm writing to through the script called recovery.log and we can see that it set up the configuration files and then started the backup fetch so we can take a look and we can see that the backup fetch process is running right now and it's pulling the latest backup set up a server my script starts up an EC2 instance in AWS which has a custom AMI or custom image that already has a script for setting up Postgres as well as kicking off the restoration and using any important environment environment variables that we need environment variables for AWS or for reporting to Datadog which is a tool that I use to report the analytics to so here I'm just going to quickly run through the script as you can see we're loading our AWS configuration here we're starting up the EC2 client we're just specifying what the instance should look like this can vary wildly we're making it 4TB to fit everything and we're just going to keep going through the configuration we're going to print out that we're creating a server and as long as the server is not running we're going to print out a loading bar and then afterwards we're going to print out the success message the server was created here's the instance ID and we're going to kick off the script on the server while working on the script the way it helped me to visualize it was to write down all of the important steps and then translate that to a log file so the steps correlate pretty closely with what the log is outputting so we're setting up the configuration files beginning a backup fetch reporting when that completes starting Postgres starting the restoration and then reporting to Datadog that everything was successful you can customize this further and have it output error messages from the other logs that you're writing to but this is pretty simple for just what we need it can definitely have more errors outputted to it so the first thing that the script that runs on the server does is configure Postgres settings so it installs Postgres the configuration file and it sets up a recovery configuration file this can be done with a script or the configure management or orchestration tool we use Puppet depending on how quickly it takes to set all this up it might be faster to just write a quick script that does it for you which is what we have here so the way you want to set up your recovery configuration file is you want to have a restore command and a recovery target either timeline, time, or name so another thing to note is that the recovery configuration file is put in your data directory the restore command here is a wall fetch and the recovery target timeline is latest so we're saying that we want the latest timeline but there might be cases where you need to roll back to January 13th or a specific version of your deploys or whatever your recovery target is so you can also specify recovery target time and recovery target name so when you start your backup fetch this is what the output will look like in the log so it says sub command backup fetch that it's beginning to download and it'll tell you it's progress as it goes through while you're doing your backup fetch this is what the log will look like in Postgres so we have restored log files and we can see that it's going through and restoring all of them from the archive once that runs through you're going to want to make sure that everything completed so the way that'll look like in the log is you'll see text that says archive recovery complete and also the recovery configuration file is very conveniently renamed to recovery.done so you can check for that as well and I just have a method here as part of my script that just checks that the recovery archive was complete on whichever day is the current day so here in the log we see archive recovery complete and then if you look a little bit further down we can see that the database is ready to accept connections so I would recommend after this happens to run a couple queries against the database and something that I think is very useful is to check the time stamps of frequently updated tables tables that you know will be updated right almost until the moment so here for Procore we have a table called session pages and session pages is very frequently written to so I'm just checking to see that the created at time stamp on the last session page matches what we would expect given that day or recovery target timeline so to check for completion another thing that I would recommend doing is showing what the latest transaction log is that it downloaded and this will be output to var log syslog and you can just check to see that whichever transaction log you see is what you would expect in your cloud storage solution after that I report back results to datadog and I just have text that says recovery successful it prints out the latest time stamp of the most frequently written table and it also prints out the latest transaction log and in addition to that I have a tag set up so that you can easily filter and sort through it and find it and this is what the log entry might look like in datadog if you don't use datadog you can use a different analytics tool or you can email yourself anything you want to do it's up to you so with all that said there are a couple things that you might run into you might have incompatible configuration for the postgres recovery server versus the master database server and I'll show an example of that in a moment you might also have an instance that's not large enough to hold your recovered database and that would be pretty unfortunate if you waited time and then you got to a point where it just couldn't restore anymore and you couldn't start a postgres you could also have incorrect keys for your Wally configuration and to mitigate that I would just recommend doing a backup list to make sure that you can get a list of all of your backups first to check that configuration I think the most important thing I can say about what could go wrong is to just check the logs for troubleshooting because they'll almost always tell you exactly what's wrong so here's an example of a database whose configuration doesn't match the master so we see hot standby is not possible because max connections equals 500 is a lower setting than on the master server it's value with 650 a boarding startup due to startup process failure so here we can see that the error tells us exactly what's wrong it's saying that our max connection setting of 500 is not as high as the master server the value was 600 so what we can do is at this point just update the postgresql.conf file and tell it how and make it match the master's configuration and then watch the logs continue rolling through so to test this I would recommend running through the script association to the server periodically to check on the logs and double checking the final recorded transaction log or the frequently updated tables time stamp but definitely not wait for something to go wrong to test this and untested backups shouldn't even be counted as backups because unless you know that you can restore and backup your entire database you might as well not be covered at all so if anyone has any questions feel free to tweet at me or email sreatprocor.com or nina.procor.com and the company that I work for is amazing and if you look at the first couple rows it has all people from Procore so thank you so much and I'm happy to answer your questions now as well so thank you if I understand the question correctly there's a user on the database that deletes all the data on the database yeah so the question is if a user that has all your keys and has access to delete your data deletes all your data what do you do in that event no sorry what's the solution that would allow you to do that yeah yeah that's what I would think okay so I'm just going to repeat the question because it's being recorded the question is if you have I'm going to rephrase it a little bit if you have a user that requests that you delete all of their data from the current running database but their data still exists in your backup what do you do in that case and I'm not sure of a solution that would enable you to do that but I guess one idea that we had from the crowd was restore restore all the backups delete the data from there and then re-archive everything which is a solution for sure or that one database for the customer that's true so to do that there might be a way you could also change the archive command to instead of using Wally use your own solution as well so any command that would run on the server would work as the archive command so the question is if Wally enables encryption for the backup and the restores and the answer is yes it definitely would thank you so much I'm also free to answer any other questions you might have thank you thank you all for sticking around it's late in the day I know that you're the end of these long conference days it's tempting to skip out early and go to the beach especially when you're well we're not exactly close to the beach thank you for sticking around this long my name is Brad Hirani I work at Procore in Santa Barbara California this is a talk about migrations doing safe migrations at scale and by migration I mean basically a schema change so we've got a web app here this is mainly a talk having to do with web apps and then every day we're adding new features and with every new feature we have to add and remove things from the database we add tables, we remove tables we add columns, we remove columns we change things, we add indexes and it's about how to do that at scale meaning without taking the site down doing it under load while real customers, real people are using our software trying to do their jobs doing amazing things so what exactly is scale what I don't know what your definition is I think ours is pretty high this is actually a typo in there I apologize but we do about we're serving about 20 million rows a second that includes not only serving 20 million rows a second but at least scanning 20 million rows a second and actually that number is wrong I don't know what I was thinking we're doing about 30,000 transactions a second which in my opinion is pretty big scale our database is this big beefy server it's got like 128 cores and several terabytes of RAM and costs as much as a new Ferrari I think like that pretty much the biggest big iron server you can buy, a serious box and we test it to the absolute limits because we have millions of users using our software there are really two types of scale that I'm talking about today one is this actual physical scale of the actual workload of the database but there's also a human component to this too this talk and my migration strategy that I'm going to show you today cannot be divorced from its human context which means not only understanding the technical side of this but also why it fits our team how our team works and why this is the optimal strategy for us and may or may not be the optimal strategy for you and understanding that requires understanding a little about what we do so I work at Procore Procore makes software for the construction industry people who build opera houses and stadiums and skyscrapers use our software for their scheduling for their budgeting we got cool like iPad apps that hook up to drones and people can fly around like construction sites and on where it affects our actual physical scale here is every morning about eight o'clock eastern time we have thousands, tens of thousands sometimes even hundreds of thousands of people all logging in at the same time hammering our database starting up their day literally sitting in bulldozers with hard hats iPads in hand using Procore to build we like to say build the construction we have got some slogan building the software that builds the world that's the one right and that's why we tax our database so heavily and that's why we have some of these challenges when we want to go ahead and add tables add columns while people are using the software every day so our stack here is where Ruby on Rails shop there's three components here right, there's Ruby on Rails there's Postgres and then there's the way we our teams function and the way we actually implement the strategy the Rails part is the least important part for this talk although my code examples are in Ruby on Rails don't worry if you're not a Rails developer and you have no intention of doing Ruby stay here this talk is still for you I do have some code samples in Ruby that are kind of essential to understanding the human component of this but I will explain what they mean Postgres like I said we're a Postgres shop and we have the greatest team in the world and it's a team that works very very very fast and our software changes really quickly so we use like the Spotify squad model and what that means is that honestly we hire people so fast I don't even know how many people work for us anymore I think we have like 80 engineers and like you know a year ago we probably had 30 and they're split up into 14 squads you know 5 to 7 engineers a piece and each of them kind of function as many sort of almost startups you know but they're all working on a single application a giant Rails monolith it's one of the oldest most mature Rails apps in the world at over 2 million lines of code we started on Rails 0.9 I think 12 years ago you can imagine that presents some challenges itself and when you've got 80 engineers all working on the same Rails code base and not only that also running schema changes running data migrations adding columns or moving columns all day every day right we deploy 5 to 7 times a day any one of those deploy could have a schema change and we encourage every squad to manage their own right we do not have one person who is in charge of the schema I mean we have a DevOps team who is also our DBAs and my coworker Nina if you're here for the last comments is part of that and Jeff my coworker who's got the next slide they're the greatest DevOps team in the world and they're awesome coworkers but we don't want to rely on them to allow us to make these changes all day every day we want all of our 14 squads to independently without asking permission be able to change the schema at any time during the day to release the features that they need to manage as fast as possible so part of the reason we chose the strategy that I'm going to show you is because it allows that certain level of autonomy we don't have to have a DBA looking over everyone's shoulder all the time they're available for help anytime we have questions but if we don't have questions they trust us web developers and I am a web developer to do all this stuff on our own which is one of the reasons we chose this strategy for migrations here that we did and that's a very important component of this so what are the goals exactly they're usually closer to 5 but we've had as many as 10 web developers are responsible for the schema and minimum involvement by the site reliability squad we want squads to operate autonomously so and what are these operations this is the stuff we're doing all day adding tables, adding columns, constraints indexes, changing default sometimes actually migrating data from one table to another various other things at scale under load Rails here has a pretty cool little system for allowing you to do that it has this built in schema migration where you type this command in the command line it creates like a little file like a migration file where you kind of tell it what to do okay I want to create a table and what that's going to do is when this runs it's going to make a table SQL statement and actually create this table and I'm showing you this not to teach you Ruby on Rails but because it's kind of important to understanding the organization, the strategy we chose why it works for us just so you can get a little glimpse of how it runs things so our web developers are all day creating files like this and deploy time these get runs and change the database under load so we end up with this list of them so here's our migration directory Rails generates all these files and whichever ones have not been run at the time of deploy get run when our code deploys we run this like command here from the command line and it fires these off and you get some output like this like a slack command and then we have this like a link to a web page where you click it and then you can see this kind of output from your deploy you know in a window it's connected through a web socket so we get this like kind of live updates as our deployers are going off it's a really slick system so here's an example right adding a table users it's got a name and an email field and then we add an index on email a unique index here's another one to add a column it's adding it to a users table it's a column called id is admin is admin it's a boolean it is a non-nullable column and it has a default value of false okay this is about as simple as it gets right a migration for a new column users as you might guess is our most contentious table every single web request from all our thousands of users everyone hits the users table sometimes more than once sometimes multiple times for a web request guess what happens if I run this under load in production I'll give you a hint wait okay we've just taken down the site and I'm gonna do you have a question sir were you about to get to that okay we've just taken down the site why is that turns out right that adding a column with a default with a not null and a default value takes out an access exclusive lock on the users table as it does that it's gonna add that new column it's got every single value in that table with false which means a full table scan on the user table we have millions of rows in our user table that's too long for us to block that table none of those requests can get in what happens is basically the web request is coming into our web servers but the table is locked so all the web requests stall right we have 20 odd web servers here a ruby web server called passenger passenger has its own queue which will only queue up 100 requests 20 web servers 100 requests 2,000 requests while this table is under lock and kaboom we've taken down the site we have unhappy customers for those of you born in the 90s that icon on the left side is a telephone it's what they used to look like before they fit in your pocket okay we just broke the site well if we want to know why right there's actually this beautiful page I have this thing bookmarked all the information you need to know about locking is right here this is one of those things it's basically just like these four paragraphs this paragraph this list this table it kind of explains all this if you've never seen this page before read it and read it and read it and read it again it's not that long but chew it and digest it because it explains all of what I'm about to explain to you and why did that fail right actually it could fail in two ways and that's an important thing adding that column with the default could fail in two ways one is we want that lock we want that as this exclusive lock on the user table it could not be able to acquire the lock in other words it could fail waiting for the lock or once we have the lock if the lock is open for too long the request start backing up and we take the site down so that's two modes of failure for that one operation that's an important point one it turns out we can mitigate actually pretty nicely which is here's the same migration you just saw that little command set local lock time out five seconds that means it will only wait for five seconds it will wait to acquire that lock for five seconds then it will time out the statement will fail but at least we didn't take the site out now that is not a panacea here we still have a problem we still have a failed migration we're still on fire our pager duties are all going off what the hell went and happened but we didn't take the site down which is better than taking the site down so we really traded one set of a problem but it's something right and actually we actually created tooling I wrote a Ruby gen that actually includes that lock time out automatically in every migration so in all these later ones you won't see that line it's included automatically through our tooling which is kind of slick okay so site is down so the actual strategy here that we use to lock to avoid that much of the talk we just showed you of taking the site down right so part of it involves a little bit of tooling this is a Ruby gen it's called hand cusp and what it does is it allows us to run migrations in phases so you kind of set it up in your Rails app if you're on Rails if not there's other things like this if you're on whatever platform you're on something probably exists like this but you set up kind of what your phases can be pre-restart, downtime, post-restart before the servers restart or basically after, downtime doesn't really come into play here but either it's basically pre-restart or post-restart so I can run my migrations before the server restarts or after and I'm going to explain why and that gives us these command line tasks that fire off these migrations and our kick ass slack bot that does all this stuff runs these commands for us automatically when stuff gets deployed there's a picture of our awesome slack integration for all this stuff so how do we use that now to avoid the problem that I just showed you turns out what we need to do is take that one and add it a column that is not null with a default and break it up into multiple parts and it looks like this well actually what we're going to do I'm not going to give you the answer right away we're going to walk through the solution a few different ways and see if this works, okay okay so first of all same thing I had before right except I'm just adding the column, the boolean column but I took off the not null part and the default fault so this turns out to be safe I simply add the column one statement and one transaction it's run, okay that turns out to be safe, adding a column is okay as long as you don't have that not null default thing in there it is a safe operation next, now I've got a problem here right, those values that new column is null, it's full of nulls and they should be full of falses shouldn't they so what am I going to do here let's try this, use it out update all and set admin to false so this is just going to generate a sequel update statement that updates every row this is admin false does anyone think there might be a problem with this one any guesses guess what we've blown this up again why is that right well turns out when you do an update all like this it has to update every single row millions of rows we've got another full table scan full table scans are bad when you're trying to do updates because the way Postgres works when you do an update is it does a row level lock on the row you've updated well if you're doing every row by the time it scans every row and gets to the bottom every single table is locked you've essentially locked out the whole table you've gotten to the bottom and we have the same problem we have before which is the database no one can access that table the web requests are queuing up and boom we're down and customers are unhappy we've got people in bulldozers hard hats on trying to build an opera house and they can't do their jobs and we're in trouble okay let's try something else we still need to update every value in this column to false how about this new little thing here disabled dvl transaction so by default when you run these rails migrations it wraps everything and begin commit disabled dvl transaction turns that off so we no longer have begin commit wrapping this kind of thing we just find in batches which is going to do these select statements with the limit 1000 pulling back in batches and allows to run updates in batches turns out this works why does this work well because remember we're under load here the load is the issue in between each update query which only updates 1000 rows and chunks you've released the lock on the table and you've allowed the other processes that are trying to get in there and do their reads and writes and deletes and what have you you've allowed them a chance to sneak in run their operations before you do the next batch that's why it cannot be wrapped in a transaction because if it's wrapped in a transaction there's no point in doing batches because you've locked the table entirely so you disable the transaction and you run in batches now we've solved one problem we have every row in the database set to false or do we or do we any guesses by the way we still don't have our not nullable constraint we still don't have our default on there right and guess what we have not actually updated every row in the database okay we have not updated every row in the database because of race conditions and here's what I mean as this is going through updating in batches you're allowing other processes to sneak in and make changes to the table including inserting new rows and because we don't have the default we aren't going to have is admin set to false right so we've probably gotten most of our rows set to false most of those million but there are a few stragglers because of the race conditions so how do we fix that well what if after we did all that then we lock the table in a big scary access exclusive mode lock just for reference here's the matrix here's the lock matrix notice the bottom row how it's all access that means access exclusive lock nothing else can touch that table while it's under access exclusive lock right we do access exclusive lock then we update them all no point in doing batches here right because we locked the table anyway we don't want to slow it down then we go ahead and we set our is admin false right we change our column default so that default constraint is on there and the null constraint is on there we're done right maybe not quite there's still potential problem here that we probably haven't solved which is we still got this giant update here right which still does a full table scan in an access exclusive lock which is basically the single most dangerous thing you can do here right so we're missing a piece here we've almost got this there but we're still missing a piece right just to recap a little the access exclusive lock prevents the race condition we did get our not null and all in the same transaction right there's a transaction wrapping that whole thing so by backfilling the table and setting the not null constraint in the same transaction we've eliminated the race condition well we still have a problem with speed though because of the full table scan so it turns out what we need is five parts count them five piece migration to get around this issue here it goes step one is the one I just showed you before which is disable the transaction update all the columns default in batches step two we're now adding an index on this user's table on that is admin column the one I'm adding but it's an index with a condition where it's only indexing where it's null so we've just updated like let's say a million rows but we had a few stragglers because of race conditions now we have an index that's keeping track of those five rows that didn't get updated what makes this possible is this awesome algorithm concurrently which is this cool feature was released in nine point something I don't know on postgres relatively recently right that allows this to happen it kind of does this like two stage add index which allows me the programmer to it allows an index to be added without locks which is a really cool feature that makes this all possible so now we have this index is taking care of basically the five straggling rows don't we now we do this stuff like I showed you before which is the big scary access exclusive lock update one giant sequel that updates all a million rows but notice there's a where is admin null that's going to add a where clause to that update where is admin is null right that's the that's the column with the index on it now that is lightning fast because it only has to update basically the five straggling rows and with the index it knows exactly where they are then still in the same transaction add the column default changes add the not null constraint now all of a sudden we've done it we still lock the table in access exclusive mode which is scary but because these operations are fast because we only have a few rows to update we've got this index it turns out that it's safe and we're okay and we don't take the site down it's not 100% risk-free because there's still the possibility of a timeout acquiring that lock right I remember I told you there are two modes of failure waiting to acquire the lock and timing out once we're in the lock we still have a little risk here but there's a big advantage to this just to fast forward by the way the last step is to remove the temporary index so that's five steps this strategy contains a tiny bit of risk but it has a big advantage which is one feature one deploy now I told you earlier that like choosing a strategy which has a lot to do with your team how your team functions, how you deploy your code and the speed at which you move Procore moves as fast as a dev shop can move 80 developers all knocking the stuff out at once what's cool is that with this five-part migration I can now release a feature that has a new column in a single deploy five migrations but this is still a single deploy I don't have to deploy in the morning deploy another branch later adding a column in the morning dropping it later it's an important part of us being able to serve our customers so even though there's a little bit of lock intention that you didn't avoid with this strategy we think it's worth it because it puts the migration squarely in the hands of the developers and allows the people to iterate fastly quickly fast and that's what's important to us it may not be important for you you may have different priorities that is what's important to us so we think this is a really great balance okay multiple deploys by the way I'm going to go through some of the operations we do which is simpler than this and talk about them there are a few that I mentioned we really try to avoid things where it takes multiple deploy to do a single feature there's some problems we haven't solved though we haven't gotten every single one for instance renaming a column there's still kind of a pain in the beat so for instance renaming we still are stuck on this kind of workflow that is not ideal, create a new column then we change the code so we want to rename a column we got the old column we add a new column and we change our app code to actually write to both of those then in a migration we backfill the new column then we have to do a second deploy later in the day that only writes to the new column and then we have to run a migration that drops it so we have not solved every single problem in terms of getting one feature, one deploy most we have renaming one of the exceptions so let's go through some of these now there's another consideration here at hand that I haven't told you about and this is where I mentioned what handcuffs is handcuffs is a bit of tool, a bit of Ruby tooling that I wrote that allows us to run migrations in phases here's why this is important when you deploy the code new code is now on these servers but it has not actually started running yet until you restart the server so we have this sort of and so by the way we have multiple web servers I think we're 20 plus now, I don't even know 35, thank you we have 35 web servers and each one of those has to go through a rolling restart every time we deploy code now the way Ruby on Rails works is kind of interesting because nowhere in the code base do we tell Rails what the columns in the database are we have a model file which may be like user but it does not have any code that defines the columns what happens is when the app starts up it runs queries against the Postgres metadata table to figure out what the columns are and then it caches the table definitions in memory so that it does it and reads and writes it already knows what's there in the table that turns out to complicate deploys a little consider a situation where we've got our database in place and we drop a column well until we restart the server that column is there and if you're not careful you can using the object relational mapper that's built in there accidentally do things that write to that column that doesn't exist which makes people unhappy conversely if you add a column well I deploy the code the code's up there it hasn't taken effect I run the migration that adds the column guess what that schema metadata cache does not know about that column you can't actually write it to it from Rails code which means depending on whether we're adding something we have to decide whether or not we want to run this before or after the server restarts so that makes our life a little simpler and that's where some of the tooling we have comes into play so that means for every operation we've got a few questions does it take it more than 5 seconds ooh that's scary sound like it could take us down can it be broken up into multi pieces okay good we've got tooling for that no we better do this on Saturday or late at night right less than is this operation say we're back filling some data or migrating some data does it take less than 5 seconds no yes are you sure yes are you really really really sure yes okay make sure we have that lock timeout right the statement timeout so that we don't lock the database waiting for so that we don't fail waiting for that lock okay create table let's start with this one here what do we have here well let's see we know that Rails has a cache in the schema metadata if we add a table we know that it won't know about it till after the server restarts so what we have to do is run this before the server restarts that way when restart it the cache gets populated again and it knows about a table and our app is happy and can write to it that actually turns out to be a pretty easy migration to one we can write one migration that adds a table and not have fear of taking the site down there's no associated lock contention with this so it's a safe operation dropping table same thing except we want after the server the app to know that it's not there anymore so we run that oh sorry we run that post restart because it doesn't matter the app's not writing to it anyway we've deployed the code that no longer we've deployed the version of the code that does not write to that table anymore the new code is out, the server is restarted we're not writing to that table we can safely drop it after the server has restarted and this is also a safe operation we don't have to split this up into multiple pieces you can't split it up into multiple pieces really drop this one, it's not one that causes lock contention add a column ok so if we're adding a new column that column is nullable and does not have a default turns out this is pretty easy we can run it before the server restarts, the server restarts the cache picks it up our app is happy, this one's pretty simple adding a column not nullable with default this is the scary one, this is the one I just showed you right so either you can do it in downtime you can choose to do this after hours or you can do what I showed you which is that whole five part series, that's what I just showed you that one's pretty tricky by the way just because I said, I mentioned we like to make things as easy as possible for our developers to maintain that velocity, to keep our team autonomous, we actually have tooling so I showed you five piece migration we have one command line tool where you give it the name of the table the name of the column, what the default value should be and it generates those five files for you which is pretty slick, it's nothing super fancy or super special, but the team's really appreciated because it allows them to get through their day faster and get things out removing columns, turns out this one pretty safe, but you do want to run that after the server restarts and as a Rails 4 that's safe at least changing a column to null so if we're going from not nullable to nullable it has to be pre-restart if we're going from nullable to not nullable this one can be really, really tricky because when you go to not nullable it's got to lock the whole table and it's got to check it's got to check the constraint so when you say I want this table to be not nullable it has to say okay let's make sure there are no nulls which means the entire sequential table scan it has to look at every value and make sure that this constraint is satisfied and you must backfill it first right change column default well if we're going from default to no default turns out you can do that after the restart that one's really safe actually you can run that pretty much whenever but if you're going from no default to a default you need to run that before the server restarts and that is so I forget why I trust me on that one okay changing column changing column okay this one is dangerous this one is dangerous changing column types it kind of depends it's different if you're going from integer to long it depends how many records are in the table it depends how long the table can afford to be locked so I'm going to say take this one case by case the amount of rows has a big you do get an access exclusive lock the question is how long does it take to sort of acquire it and release it and that depends on the table the amount of things joining against it or trying to insert or read from it and a few various other things there are very fast rules for this one just test it out first try to have a test environment where you at least not know at least if you can't test it under scale at least understand look at that lock matrix to understand the locks you're taking out adding foreign keys so this one is also a thorny one foreign key constraints turns out so if I've got users and I've got profiles and I had a foreign key constraint it would be the user ID and the profile table I had that foreign key constraint turns out there's an access exclusive lock on both sides the reason Postgres has to take out those locks is to prevent the raised conditions it has to lock so that because it actually has to enforce the constraint it has to say you've got this foreign key I have to make sure every record with every one of these foreign keys actually exists as primary keys in the other table I have to scan and check that I can't let any of the rows sneak in and do inserts in the meantime so I have to take a full blown lock on this thing and turns out that's on both sides this one is really difficult to avoid you should probably be doing this in the middle of the night or at downtime we could not find a way, at least not yet not without really elaborate systems of triggers and shadow tables mirror tables where we're changing table names the last second of doing this in a safe way and of course you need to be really careful too that your app code is correct because if you deploy a feature that relies on a foreign key constraint you expect not to be inserting things that violate that constraint if you are wrong and your code has a bug and you introduce a row that violates the constraint and then you go and add the constraint this is going to blow up so be really careful with this one probably don't do this in the middle of the day whenever it's trying to get to work removing foreign keys piece of cake not unique indexes this one is pretty easy but just remember to disable the transaction and to use that algorithm currently if you do a normal old ad index it will lock the table if you have that algorithm currently in there it will not you cannot do that inside a transaction take advantage of this feature I don't see any other reason actually I do see another reason be careful about this, there was a bug currently they just fixed it I think in the last release watch out for that one if you are on, I forget exactly the version be careful about that yeah that's it for a new index right but I think they fixed that bug but you might be still susceptible to that if you are on a slightly older version I forget the versions I want to tell you but I don't want to tell you and get it wrong so adding a unique index is the kind of the same thing fortunately Postgres has this awesome sort of double scan it doesn't like this two step scan which adds this is a great feature we love it a few things to notice that your developers will be interested in realize that if you're a web developer and you're creating these migrations that when you start to use this tooling to run some of these things pre-restart and some of them post-restart that you actually change the order things run in for instance if you do these in this order the pre-restart ones are going to run first and the post-restart ones after I was really nervous about this when we deployed that thinking what kind of weird subtle bugs could that introduce and thank goodness we haven't found any rollback by the way you always have to make sure that migrations you've got a strategy for rolling back things I would not really attempt to try too hard to automate rollbacks almost every time if something goes wrong and hopefully they don't go wrong using a strategy like this too often you don't have to set up your things to like automatically run the migrations backwards again you probably want someone in there hand massaging this right and getting this right because every case is sort of kind of different and thorny trust humans over tooling on this one you know dealing with the failure case if a migration blows up so we've got some trade-offs don't we that's supposed to be on one line that's supposed to be on one line anyway we've got some trade-offs here don't we strategies are not one size fits all we chose this strategy because despite having some locking risk it preserves our team autonomy it allows our web developers to function independently when you do that though you have to have good training right there's the trade-off we put this in the hands of the devs not the DBAs you have to train the devs everything I taught you today this actually this whole presentation started as an internal presentation of me teaching my co-workers how to do this right I thought the content was so good that I'm sharing it with you now but that means that everyone in our case all 80 web developers or at least the team leads or the people sort of watching over this stuff have to understand everything that I just taught you right so there's a significant investment here in training and in tooling as well you know by allowing our developers to write this in Ruby which is way more familiar to them than writing out raw SQL and dealing with a lot of you know direct interaction with Postgres through the command lines through SSH by wrapping it up in Ruby we did have to invest some time in our Slack infrastructure in our deployment infrastructure in the gem I wrote that allows us to run these in different phases in the gem that automatically has the time out so autonomy requires training and tooling and you know if you're going for this sort of strategy make sure you budget for that another trade-off we made we thought it really really really important as much as possible to allow one feature, one deploy which means adding a column with those five migrations only requires one deploy as far as I know maybe someone out there can prove me wrong as far as I know with that constraint it's not quite possible to eliminate all possible lock contention you still sometimes need occasional access exclusive locks the goal is not to eliminate them but to provide time that the transactions spend inside that lock that's the trade-off we made we think it's worth it and I guess in general speed and we do speed does require some risk as I said the human element is this is the most important our organization values deploying five times a day with migrations is necessary which means this is not the world's most advanced strategies there are some that are more involved that have triggers adding new tables using triggers to copy stuff from the inserts from the old tables into the new ones things that create a shadow table right where they backfill in the background and then at the last second it swaps the table names there are more advanced strategies than this we haven't implemented them because we're happy with this and because it allows it puts it in the hands of the web developers and allows us to maintain our speed these are not one size fits all it tells you it's just what works for us so a few resources here if you're interested if you happen to be a Ruby shop and you want to check out the things we made check out the handcuffs gem the strong migrations gem I didn't talk about that that's another one there's a couple articles about how this works in zero down time this one happens to be really good I will tweet these slides follow me on twitter my name is down there at bradurani I'm Brad I sometimes blog but not very often I haven't updated in a while so if you're junkie follow me I'll follow you back as I said I work at Santa Barbara I think it's the greatest place to work in the world this is the view from our office you can watch whales migrate through Santa Barbara channel Santa Barbara if you don't know it's about an hour and a half north of here it's basically paradise on earth if you'd like to come work with us solve these interesting sort of problems get out of this polluted rat race and come up with the air is clean I would love to give you a business card after this and tell you about us thank you very much any questions we're actually in carpentria most people don't know where that is carpentria is this beautiful little beach town about eight miles south of Santa Barbara it's got this like little main street that runs right into the ocean and it's just like it's just happy there are seagull babies running around and ice cream and all the fun things you can think of that's what it's like every day it's like pleasant land by the beach yes so we're not doing evil we have the greatest squad of QAs in the world and I really mean that so every squad has a QA on it some of them more than one and these are people who are experts in not only software but construction as well even though they typically don't come from that background they learn it on the job and we practice TDD for our features but we also have heavily component of manual testing from our QAs as far as like if what you're talking about is it more about us merging all these branches at the same time in provincial conflicts so we use TDD is developer level so every developer does life tests for everything they do or they're supposed to at least and then every time we commit yes it kicks off a build we use CircleCI you know which is awesome it's hosted it's expensive but we don't care it saves us time and it's all cloud based it's awesome basically what happens is each branch runs independently in its own CI that creates a new merge commit and then the master branch will build and then it's up to the devs manually actually okay it's not manual what we do is we have it rolling on a timer through most of the day it's not exactly continuous deployment because that would mean every merge would deploy the problem there is that with 35 web servers the rolling restart can take like half an hour so we can't quite do it per commit it's basically on a one hour timer through most of the day and then we take a pause in the middle of the day for us to make sure that the code is good that there aren't any bugs and to allow teams if we do accidentally release a bug we've got sort of a pause from late morning to early afternoon where people are allowed to only roll out patches and then sort of late afternoon we allow people to merge and deploy again so it's kind of like that it's almost TD just one step off and then a lot of times what happens is because occasionally we do have to like batch like two or three merges if everyone's trying to kind of get stuff out at the same time and we don't want to wait for the whole because our circle build takes like 25 minutes and the deploy takes like 25 minutes sometimes we will merge like three at a time wait for that branch to press branch to press the I and then either manually deploy it or wait for the next one hour mark which we find is really really good we like that system does that answer your question are you using rescue by chance are you using rescue rescue is a background worker thing right and it runs it does not properly cache the metadata queries and ends up running way too many of those I actually don't know the answer specifically to your question I don't remember ever seeing any memory issues related to the schema caching but certain gems watch out for certain gems like do like a do like a log capture of your SQL database someday right and make sure it's not running like crazy amounts of obsessive metadata queries we actually see one per roughly per table not exactly but one roughly per table per web server but you watch and check and make sure you're not running more than that and there might be some bug either in like rescue or some other gem that interacts with active record it's possible could be yeah Rails is a bit squirrely and the memory management the class loading and some of that stuff do you mean during the migrations or in the app code during the migrations yeah we use the model in the migrations I think it's best practice like so if you've got user and you've got a user model right I think it's best practice to redefine the user class at the top of the migration right and add any new scope to the relationships you need not it's not a total widespread practice some people just use the one that's in our app which can cause like minor issues like if you people get behind on their migrations like on their dev machines and things change in the meantime but it's usually not it's not something we worry about too much yes we do we do use the app code in the migrations not every time but some people do yes yeah so that is a solution and it's arguably a slightly maybe more advanced solution than this but then you've got multiple deploys because you've got one deploy in which you add the triggers right and then later you've got another deploy which removes it and you know so that's two it could be even depending on what you're doing it could be even three deploys yeah there are other solutions out here this is like between just like naively running anything and potentially breaking things and the world's most advanced with temporary tables and triggers this is somewhere in the middle right it's just a tradeoff that we chose that we like the strategy based on the tradeoff that gives us and really we've had very few problems with this it has happened but we almost never have problems with migration I won't say never but pretty close to never anyone else thank you very much testing testing one two three one two three four five six seven eight nine ten thank you yeah I'm going to run to the bathroom as soon as he gets my audio checked out one two three four five six seven eight nine ten one two three four five six seven time to get started here if I can find my pocket we're going to talk about laboratory retrievers I'm just kidding so my name is Jeff Frost I work for a company called Procore up in Santa Barbara and we do construction project management software as a service so what does that mean we help people build bridges skyscrapers and all sorts of fun things stadiums etc and I'm here to talk to you today about minimizing your major version upgrade downtime using sloney so now who out here has done major version upgrades before and just to be clear when I say major version upgrades I mean going from like nine two to nine four and not that last decimal point so how did it go smooth yeah okay so there's a few different methods of doing major version upgrades there's the old tried and true dump in the restore right there's also PG upgrades and then there's various logical replication tools which is what we're going to talk about today so dump and restore is everybody's old tried and true guaranteed to work you can even do it across architectures it's pretty simple you can either just PG dump your database pipe that to P-SQL point P-SQL to a new DB server and exercise patience you can also dump it to what we call the custom format dump file and then you can rsync that file over and then use PG restore with the minus J flag so you can leverage the tons and tons of computing power in your new DB server hopefully and I would say it's probably fine for BB's around maybe 300 gigs depends on what kind of maintenance window you can find so has anybody done PG dump PG restore for an upgrade did it turn out like this me too so PG upgrade is the next option it's a good option if you need to do the upgrade in place so not everybody has the luxury of having an extra server that they can replicate over to maybe you don't have primary keys on all your tables I'm sure everybody here does right the only downside is it's a one-way trip if you're using that minus minus link option and so hopefully you've tested your workflow with that new version of Postgres now brings us to logical replication this is not an exhaustive list of the tools but these are ones that I'm familiar with there's Bacardo there's Longist by the Skype folks and then there's Sloney which is what we're going to talk about today he just asked me why do you like Sloney so much for doing major version upgrades in fact, Quinn asked me that just last night and I said has graceful switchover so what do I mean by graceful switchover so it's Sloney defines switchover and also failover so failover is when your database just died and you need to bring up the replica and you realize that there's going to be some loss of data on there whereas a switchover is graceful in that it takes the replication set and moves the origin and then starts replicating back in the other direction and that gives you this awesome second thing that's my favorite part which is you can gracefully switch back when things go wrong on Monday morning so let's talk about it at a high level really quick trigger-based logical replication you need to have primary keys on all your replicated tables you're going to kick off an initial sync which is going to be very similar to a PG dump Sloney handles all that for you but it's literally going to copy each table one at a time over to the destination those triggers that it put in place earlier are going to be used to store all of the DML in these log tables so then after that initial sync is complete it's just going to replay all of that stuff on the replica until it catches back up and then a little bit of Sloney trivia Sloney is Russian for a group of elephants so let's review the docs real quick Sloney has these basic terms it has clusters, nodes, sets origins, providers, and subscribers you're going to go oh my goodness what are all those so a cluster according to the docs is just a named set of PostgreSQL database instances that's pretty straightforward we have to give our cluster a name because I'm lazy and love to reuse my scripts I like to call it migration because that's pretty much all we do with Sloney at Procore that's going to have a side effect of making this underscore migration schema which is right here in the DBs that are part of the replication that's where it's going to put those log tables that's also where it's going to store other metadata info about your replication alright so what's a node it's just a database that's part of the cluster and it ultimately is defined by a connection info string so here's a couple examples pretty straightforward one you've got a DB name, you've got a host name and a user I always recommend using an individual user for the replication makes it easier to debug if things go sideways on you and then here would be maybe this is node 2 notice the only difference is that it's called my new server and then if you happen to be a node that you can use maybe you're actually just replicating to the same server but you're running Postgres on a different port so that's also an option now here we're going to use the word and the definition of the word so a set is a set of tables and sequences that are to be replicated the meat and potatoes of what we're actually going to be doing and you can have multiple sets in a cluster which is handy if you want to ship some stuff over to Data Warehouse but we're totally not going to do that today so we're going to put all our tables and sequences into a single set and ship that over to the other side and then finally there are origins, providers, and subscribers which is somewhat confusing to some people who are new to Sloanning but the origin is the read write master so that's going to be your old DB server then you've also got the fact that the origin is the first provider which is where the confusion usually comes in and the reason for that is that subscriber nodes receive their data from providers and not origins and for the purpose of this tutorial origin is also going to be the only provider alright so how do you get it installed if you're on a Debian derivative like we are at Procore you can go to app.postgresql.org they have two packages, you'll notice that this one is versioned by your Postgres version that's because it's got the C extensions in it and then this one just has the phonic command processor and the slondamen command line tools in it if you're on a red hat yum.postgresql.org has it here it's packaged a little bit differently, it's all packaged up in one that's versioned by your Postgres version and if you're old school maybe you just want to go grab the tar ball untar it cruise in the directory.slushconfigure make install and have at it that way so I have one recommendation for you if you're doing this if you're doing slony in production normally I recommend you just put a freeze on all your DDL changes because you probably don't want to learn how to use slony with your migration tool it can be a little bit of a pain it's totally doable but you figure we're only going to be running this for maybe a couple days over a weekend or something like that no reason to hurt yourself and hurt the other devs just declare a freeze and then list the freeze after you're done migrating so let's get started what are we going to do? we're going to make a schema-only copy of the DB then we're going to write our first line script which is going to have all of these lovely pieces in it we've got a preamble we're going to initialize the cluster we're going to define some node path info create a set add the tables to the set add sequences to the set subscribe and then we're going to kick it off so we need to get a schema-only copy of the DB what's that mean? I grabbed the mouse genome database DB because it was about 40 gigs which allowed me to capture all the logoutputs that I wanted I'm going to pipe that guy to p-sql minus my post db2.jeftest you will have had to create the empty DBs already who wants to see a live demo? me too! but then I came to my senses and I made some screencast so we're going to do I wonder how readable that is in the back of the room looks like terribly that's unfortunate especially because I have several of these so that's just running the command we saw and then one other item of note here is that we're going to redirect standard error to a log file because this is going to, as you're about to see in a second generate a ton of output so now we're going to take a look at that log file and it's empty, which is great no errors, restore worked that's awesome if you found errors in this log file you're going to have to go fix them otherwise guaranteed replication is not going to work for you most likely cause of the errors is probably some extension that you forgot to install on the new DB server maybe some rolls that are missing, etc so let's write our first Slonic script so Slonic is Slony's command processor kind of like bash for Slony you call it just like any other scripting language with the shebang looks like that if you happen to be using the Debian package and the good old Slony trivia Slonic means little elephant in Russian all right, so part one the preamble so we've got our shebang up here at the top that's actually not technically part of the preamble just this piece is and the preamble goes at the top of all of your Slonic scripts so we're going to define the cluster name here again I call it migration I like to reuse my scripts as much as I can and then we're going to define the admin con info so the admin con info is what Slonic uses to connect to the databases not to be confused with what Slon uses to connect to the databases we'll see in a minute so we're just going to define it here I've cleverly named my database server db1.gestTest here's our mouse genome database Slony user in port and then node2's admin con info is almost identical except for db2.gestTest so next up we're going to initialize the cluster and this lovely thing here I started putting these arrows in because I didn't think I could borrow Bruce's clicker that has the laser pointer but I was able to so this is an arbitrary id for your origin node most people probably use one you can use whatever you like but do realize you're going to be using this several times later and also it's good to give it a nice comment I always highly recommend you don't say something like node1 or bob or whatever I like using the server's name if in the case where we were replicating to that same postgres cluster on the same server different port I would probably call it something like db1.gestTest old and db1.gestTest new or maybe db1.gestTest94 and 96 as long as this script is pretty helpful alright then we have to initialize node2 we do this with the store node command you notice we've defined that node id here again that's an arbitrary number this usually makes sense pick whatever you like 2000 it's fine I'm going to give it a comment again so db2.gestTest in our case and then this is going to be the origin node id so now we're going to set up the path why are we doing this again did we just do that up in the preamble well remember I told you that the add-in con and so is actually what slonic uses to connect to the databases this is what the slondamen uses to connect to the databases and they're usually going to be the same but it's possible that they aren't maybe you've got a VPN in between the servers or you've got some command and control system where you're running your slonic scripts etc they're usually the same in our case they are the same it's a little weird to read this at first but what this means is client node 2 talks to server node 1 with this connection string so 2 talks to 1 with host is db1.gestTest and then the rest of it and down here node 1 talks to node 2 with this connection info string now it's time to create our set Simone loves to let you create arbitrary id numbers I like to call this set id1 here we are using our origin node id again and then got this clever command all tables and sequences I even put an arrow there because I didn't think I'd have a laser pointer alright, time to add tables to your sets who here has primary keys on all your tables not you if you work at Procore you don't sorry we're still working on it so if you are one of the people in that fortunate group you can use this tables constructor so the command is set add table this is the id of the set that we created that's our origin node again and we can say tables equals public.star that's going to add all the tables in the public schema well unfortunately the mouse genome database has zero tables in the public schema so we also in our demo here we're going to have to use mgd.star all of its tables alright now learn from my mistake when this new tables construct came out the next time I did a migration I was like this is so awesome but we have like three schemas I'm just going to do this use star guess what happened it happily added everything in that underscore migration slony schema as well the initial sync completed it synced those tables over and replication completely blew up and started copying our four terabyte database again so definitely don't do that alright now for the rest of us who don't have primary keys on all our tables I regret to inform you that we have to do it this way which is the old school way still set add table, set id is one origin equals one then you're going to say full qualified name equals and you're going to list out every single table one at a time sounds awesome right except for of course the ones that don't have primary keys you're going to leave those off so here's some sequel to the rescue to actually output those commands because you probably don't want to type them all Mr. Fedder it's funny that you asked because I was getting to that I tried really hard to pretty print this and it actually ends up being like this big and then completely unreadable from a few feet away so I promise the slides will be up probably by the end of the weekend you can copy and paste this and you can pretty print it to your heart's content I probably should but again I'm lazy to like to reuse my old stuff alright well what about our tables that don't have primary keys right so let's add primary keys if you can some frameworks aren't going to let you get away with it so we're going to find ourselves in this boat so we'll just dump and restore just those tables during the maintenance window now highly recommend you write a script to do that also highly recommend you write a script to do them back the other way in case things go sideways on you and you're going to want to time it anyway so you can get an idea how big of a maintenance window you're going to need ultimately hopefully these aren't your big tables if they are your big tables go back to step one there and find a way to add a primary key and go back to PG upgrade alright and don't forget the sequences otherwise super fun stuff happens your initial sync completes you get your migration window you throw up the maintenance page, you switch over bring the maintenance page down and you get all kinds of duplicate primary key errors and sadness ensues so the constructor for this is almost exactly like set add table it's just called set add sequence and you can do sequences equals public.star sequences equals mgd.star no primary keys required on sequences so probably no reason to do it this way but just in case you're interested you can still do it the old school way and list them out if you are trying to get carpal tunnel or something finally we're going to subscribe our set so here's our arbitrary set ID the provider not the origin but in our case it is the origin node and then the receiver is going to be node 2 the jet test this is completely unnecessary for our particular use case because if we fail this over it's going to be the only node in the cluster anyway alright so here's the entire maybe unreadable from the back of the room script alright so let's go through it one more time so we've got our shebang at the top here's our preamble so cluster names migration first one we're going to initialize the cluster node ID of one is going to be called db1.jeftest then we're going to store node 2 and call it db2.jeftest put the past info in here create our set, add our tables and subscribe our set alright let's kick off our script seems slightly more readable to you not really looks so much better on our projector at work so we just kicked off our subscribe here this is a warning that says 9.6 is the version that it doesn't know about Federer do you know if 9.6 is listed as a supported version already yeah I'm not sure that happens every time a new version comes out until it gets specifically added in here so what it's saying right here is that it's reverting to 8.4 support alright so it says waiting for events only at some event number that's unreadable node 1 the reason for that is that we haven't started at the slum daemons yet now if you were in the room earlier for Brad's talk he was mentioning that we do a few transactions a second like somewhere around 30 to 35,000 a second so if you happen to do this during the day something like that might happen and we used to be able to get away with it and so this is the most recent migration when we went I'm sorry two migrations ago when we went from 9.2 to 9.4 and did not get away with it took the site down everybody was sad so now since 9.3 also Brad mentioned this in his talk this awesome lock time out setting is available so what that does is abort any statement that waits longer than this arbitrary amount of time for a lock to access exclusive lock several hundred or thousand tables since we only need it for the trigger edition we're just going to use this pgoptions variable and throw minus c lock time out equals 5,000 5,000 milliseconds before we kick off our subscribe now if you fell afoul of that and your statements got canceled it's going to look like this this lovely what looks like a big back trace but this is what you're actually looking for it's going to say error canceling statement through the lock time out so now you're actually in a bit of a pickle because you don't actually know how well you're going to be able to do actually no I take that back so in our case this was happening always one transaction so everything would have gotten rolled back it's probably still got the underscore migration schema in it so if you want to just try again you can just drop that schema and then rerun your subscribe you can't get away with this probably can't get away with it during the day if you've already hit it if you can't get away with it in the evening you might have to take a slight maintenance window there's one other option though which is a little bit outside of the scope of this talk but it's what we have to do at Procore which is we write individual set add tables as individual slonic scripts and that way you're only locking one table at a time and if one fails you can just retry that one and we use lock time out as well alright currently on db1 and db2 that jeff test we didn't run into any of these problems because we're the only ones using it so let's talk about slon so slon is the sloney daemon which manages replication you're going to need to run one of those on each node and the sloney trivia for that is that slon is rushing for elephants alright so we're going to start them up since we already got our subscribe going now if you're going to run this in production sloney that is I would probably set up a sloney.com file having a nitscript all kinds of awesome stuff but since we're going to run it for probably about three or four days I'm just going to throw it in the background so I'm just going to use nohup to call the slon daemon you're giving it the cluster name here that we defined earlier not to be confused with the schema name which has the underscore in front of it and then the connection info for this node responsible for and then we're going to redirect the output to a log so we can take a look at that later now if you just leave the host off it'll just use unix socket which is totally fine I just put this in because it's easier to realize that this slon is responsible for db1 and this guy is responsible for db2 alright so I catted that I wrote a script so you wouldn't have to watch me type unfortunately you can't read it what if I changed the resolution on that screen if it would help so we started it up on that side we started it up on db2 then we're going to tail the log you can almost read it so you can see here on db2 it actually started it says begin copy of table mgd acce session and so that means it's off and running you're also going to see it's got some config sort of debug output here you can see barely that this says that's a 9.4 version I think and this one's a 9.6 version since that's pretty unreadable it's roll here alright so now you're going to watch the logs and exercise a lot of patience the reason I say you're going to exercise a lot of patience is I imagine the reason you came to this talk is that your database problem is not under 100 gigs if you're like us it's closer to 5 terabytes for us I think it takes maybe about 24 hours maybe 30 hours nowadays and you're going to see a bunch of output that looks like this begin copy occasionally you'll see these notices it says couldn't truncate it, that failed I'm going to do a delete that's just because it has foreign keys the truncate table has foreign keys you have to do all at the same time and it's not trying to do that then you're going to see something like this it says lots and lots of bytes copied for table such and such and then you'll see there's about 5 minute in between here so what's going on here is the copy finished and in between these two log entries where it actually says hey it took me 300 and something seconds to copy that table it's re-indexing it so that's a little bit faster than if you copied over there what the index is already in place this also has the interesting side effect of making it nicer to the master or the origin slash provider node because what it does is it copies one single threaded so then it goes over on DB2 happily re-indexes it and the master is doing the triggers are still there but otherwise it's not generating a bunch of IO on the master so it tends to be maybe a little bit nicer load wise than like a PG dump for example alright so you can take a look at this SL status view what you're going to see here is that the lag number of events and the lag time the initial sync is just going to ever increase until the initial sync completes then what's going to happen is this is going to slowly come back down as it replays everything that was in those log tables then eventually hopefully the next day are at least the same day as our maintenance went up and we're ready to switch over we're going to see something awesome that looks like this it's going to say copy set one done and in our case for that 40 gig database it was 1,837 seconds which I think is about 30 minutes ish sort of a sense of how long it takes so awesome then we're going to take a look at SL status Sony tries to keep this lag time under about 20 seconds by default so we're looking pretty good here we're thinking alright we've got our maintenance window, we're going to be good time to switch over right time to write a new slonic script cleverly call this one switchover.slonic it's going to look pretty familiar now we've got our shebang here's our preamble that we're used to using we haven't seen yet the first one is lock set this is the set ID and this is our origin node what this is going to do is lock it so that nobody can insert or do any modification commands and then we're just going to move the set so here's ID one that's the set ID this is the old origin db1.justTest and the new origin is db2 alright you think we can switch this thing over alright so the first thing we're going to do is we're actually going to look at SL status you can kind of see here that it's about 20 seconds it's got one one event behind that needs to be replaced so that's pretty good what you don't want to see here is that this is like an hour or something because if you run your switchover what's going to happen it's going to have some time in the future switchover which might not even be during your maintenance window right there's our command here we're going to actually time it let's see how it goes end of day on Monday or whatever a busy period for you is going to be because that's the best part about SLONI if we have a regression we can just switch back so what does that look like it looks almost the same as our switchover script so our preamble is the same looks like I forgot to type the shebang that's good and these guys are almost the same our origin nodes are switched so the origin is 2 here in the lock set the old origin is 2 here in the moveset and the new origin is 1 give it a shot real quick anybody want to guess how fast it is actually it was a little bit faster 0.296 seconds is what it looks like alright so what if we didn't find a regression let's rip it out we don't need SLONI anymore so that can be as simple as just killing the slongs and then dropping the schema cascade but watch out for locking right if you drop this schema cascade like this that's going to take out an exclusive lock on every single one of the tables to pull the triggers off and you'll probably see that DBA with the fire and brimstone around him again so safety first is what I always say killing the slongs killing them on DB2 as well you can see they're running right here you can also see their postgres connections right there got them gone so you probably can just barely make it out but I'm setting the lock time out in my session here and then I'm going to drop the schema cascade we're going to do the same thing on DB2 now DB2 would be the hard one by the way because this is where we're actually running the app now or running the app against so this is the one where you might run up out of the lock time out managed to drop it even with the lock time out what you can do sort of like when I was saying let's write individual set add tables in individual slong scripts there's actually a set remove table which we also have to do to clean up after ourselves until you just make an individual set remove table it just takes the table ID and then after you've got all those triggers out you can drop the schema no problem it's pretty much it as you may have guessed we are hiring we've actually got a booth over in the exhibit hall if you want to come by if you like to surf Santa Barbara is awesome for that our office is actually right on the Bluston Carpenteria you can see the rincon point from there one of the best surfing spots in California and any questions there's an excellent question so the question was what happens if one of the servers goes down so if the replica goes down and you bring it back up are we talking about the during the oh so there's four parts here is it during the initial sync or while it's catching up or when you're waiting to switch over it's probably more interesting in the sync state so if it happens to fail during the sync state first of all we just backgrounded our slon so you're going to have to start it up by hand again so don't forget about that in the sync state it's actually going to have to kick off the sync initially again which is a sad and much maligned thing about sloning it doesn't save the state about the initial sync so if that fails for any reason it'll just start it over again it used to actually start it over again and then always fail but that's been fixed fortunately so now if it fails after the initial sync but while it's either catching up or it's just been replaying for a while while you're waiting for your maintenance window start the server back up start the slon back up it'll connect and carry on about its business so what if the primary goes down during the initial sync it'll fail your initial sync you'll also probably get pager duty so there's that but it you'll have to start the slon back up on that side what that slon is doing by the way during the initial sync is just waiting for the other slon to tell him hey, I'm done, I'm done, I'm done I'm in the state so you could actually think that your replication is still happily continuing and then it would finish and you would look in the SL status and it would still say you're like one day lagged or something and then if you actually start that slon back up it'll notice the messages in the events table and carry on anybody else? Mr. Federer, I have not and I'm actually interested I didn't list PG logical mostly because when I was looking at the docs and I haven't really played with it I didn't see any failover method I assumed that you could just sort of tear it down to do it so it would put it sort of in the category of one decent Bacardo where it doesn't have like that graceful switch over graceful switch back but at least you could sort of use it by the way there's one big caveat with this which is you can't use it slowly with RDS because you need to see extensions which is a big downside oh yes DMS, yes so AJ would love to tell you all about his experience with DMS I can tell you that it was not positive we were actually so this is kind of unrelated to slony but we were actually migrating one of our services from our open stack, private open stack cluster up to Amazon and we initially went to use it to end RDS right and we initially were using Bacardo to do it right and then for some reason it failed and we weren't sure why and as part of the sales or not the sales call but the tech call with the RDS guys they're like oh we have this new tool called database migration service we're like awesome let's use that and then he started it up and proceeded to have all sorts of problems with it and then we just gave up on that way back to Bacardo and we're happy so but AJ will be happy to fill you in on the details alright anybody else? thank you very much