 The title of this talk is Lessons Learned Operating Postgres at Scale. I've been using Postgres for nine years. I think my first installation was 8.2, which we upgraded to 8.3 in 2000 and something. That was upgrading from an esoteric Perl database called xml.com. If anybody knows about that, that's crazy, but you should come say hi. It was backed by MySQL 4. And so this was about a 20 gig database for a content site, very read heavy type of application. And I've been using Postgres ever since. So also I've been a bearing tree for five years. We were acquired by PayPal two and a half years ago. And so the growth trajectory has been fairly insane to put it mildly. So lots of really fun challenges. Primarily today I work on application infrastructure scaling at brain tree. But over the years, I've had my hands in lots of different products and whatnot. So yeah, so who is brain tree? We work with all these awesome companies and a lot more. As you can see, many of these are companies you've probably heard about. If you're staying in an Airbnb right now, we probably processed payment. If you've taken an Uber, same thing. If you have code on GitHub and a private account, etc. So at a high level, we make it really easy to accept payments online and in mobile apps. We have some tools that make that range from very drop in where you have very minimal code in order to configure your payments experience. Or you can go much lower level and really make a kind of handcrafted thing. So all right. And have say, disclaimer, these are my views, not those of PayPal or any PayPal subsidiaries. Also, these are based on my experiences. Your mileage may vary. I do promise that I'm being as accurate as I possibly can be. But many of you here are much smarter people than I am. And you may have slightly different takes on some of this stuff. So I hope that if you do, you'll come talk to me about it, because I would love to hear. Okay, so scale. As we grow our applications, our customers grow, we grow, we build new products. And eventually, I've found that everything tends towards an online operation. I don't remember the last time in my professional career that I've been able to take a maintenance window. So we spend a lot of time, I think generally as database people, trying to figure out the right ways to make trade-offs between availability for purpose A or availability for purpose B. I don't want to associate scale with size, even though size does tend to make life more complicated for your database. Because you can have all kinds of weird situations. I remember talking to somebody at a conference, it might have been here last year, who ran very small databases out in the wilderness. And those were eventually synchronized back. So that's a different sort of problem than you have if you're trying to store tens of terabytes of data in a single Postgres database. So really, it's about your trade-offs. What are the resources you have available to you? How do you want to manage them in order to get your work done? So I'm going to talk about a few different things that I've found I have to consider over the years. And hopefully those are things that will be useful for you as well. So first, most people who operate Postgres, we know that B tree indexes will bloat over time. So if you have update heavy tables, for example, let's say you have a table that is a user's table. You have a column that's called something like last page view at, I don't know, if you want to see most recently active users or something like that. Questions of whether that's the right way to store this data aside, you could do that. And then you'd have this one user's table that's super heavily updated. So when you do that, if you have an index around that column, it will eventually just grow and grow and grow and grow until you re-index it. So as a quick example, if you have, we create a simple table, insert 100,000 records into it, the only column we have is a big in primary key. And I only did that because I was lazy and didn't want to create an index in a separate command. So we get about a two megabyte index, delete 50,000 records, still 2.2 megabytes, even if you vacuum this. Because the data is sort of skewed throughout that space. And then you add another 50,000, our index is three megs, re-index, we're back down to two. But note that re-index is not a concurrent operation. So again, if we can't take maintenance windows and we can't say, well, it's all right for me to block rights to this table for some long period of time, re-index is probably not going to work. You can't create 100 gig index instantly, that's going to take a lot of time. Of course, if your business permits you to take that down time, you're a school, you don't mind if somebody is doing work on Saturday at 4 AM. You're in financial services and maybe when the markets are closed, you get to have a little more flexibility. Maybe you can do that, but I have not really known anybody to use re-index. So what we want to do is we want to re-index. First, we're going to monitor our bloat because that's what we do. And you can use either check Postgres, the Bucardo project has this plugin which outputs in about any format imaginable. I think it's like Nagios, Ganglia, and all kinds of other good stuff. And then Josh Berkis has his script that he has up on GitHub that you can also use to monitor your bloat. So once you see that you have your bloat, we want to manage it. So if your indexes have grown from 1 gig to 10 gigs or 10 gigs to 100 gigs for really not that much data, we can either use PG repack or you can write your own re-indexing scripts. It might look something kind of like this. But there might be some more stuff that you want to do. And I guess the trade-off that we have to consider here is that ideally we would re-index fairly frequently because it's nice to keep your indexes, your index sizes manageable. But now you're consuming a lot of your disk I.O. that you might want for customers to do stuff or users to do stuff or what have you. So when you're considering how frequently you want to do your re-indexing you have to consider how often am I willing to eat the performance overhead associated with this maintenance operation. And one important gotcha, if you're not running 9.2 and you don't have drop index concurrently, you'd better get your index like the actual data files out of the kernel page cache before you drop it because otherwise you'll drop index and the... So drop index takes out a lock on the table that you're dropping the index for. But those files aren't actually removed immediately. You have to wait for page cache to empty everything out. So if you're expecting drop index to run in 100 milliseconds and you have 100 gigs of index hanging out in memory because you have a lot of memory or something then you might have a worse time than you think you're going to. So for stuff like this I would always say just try it and just upgrade if possible because then you can drop index concurrently. And I just note this also impacts truncate. Another fun thing that I think... I know at least I feel like I look at a lot more these days is partial indexes. I mean, a good way to not bloat an index with updates is to not do updates in the first place. Yay. So if you had a large table with some sort of sparsely populated column, add is not null as a condition to that index. You probably don't need to track like millions and millions of nulls unless you're specifically querying off of that. And you also might have a table where you have some sort of very low cardinality column like maybe you're 10% true and 90% false. You can just index your truths and then if you need to grab all those truths you can get them very quickly and all as well. Sometimes you can design for this to make it your experience a little bit better upfront but as we well know, as systems grow and evolve you don't always have the luxury of making fairly dramatic schema changes. So you might want to add something like this to make your life easier in the meanwhile. So that's a quick run through indexes. I want to talk about auto vacuum. Auto vacuum is something I've spent a little bit more time thinking about probably recently over the past maybe a year and a half or two years or so. So everybody here may be familiar with auto vacuum but in case you're not, auto vacuum is responsible for periodically looking and seeing a lot of this table has changed. Therefore, we're going to run a vacuum operation to garbage collect up the old dead tuples. We're claiming the space for Postgres. It also has this important attribute that it's going to prevent database wraparound which is a condition that will cause the database to crash. In order to support MVCC, Postgres has two billion transactions in the future, two billion transactions in the past at any given time. And so the side effect of that that's important to know is that you need to freeze your tuples at least every two to 31 or around two billion database transactions. And I think this has probably started to become talked about a little bit more in the past year just because a couple different companies had or at least broadly outside of the Postgres community because two different companies had fairly major auto vacuum related outages last year. So another important attribute of auto vacuum is that it cancels if a conflicting lock is taken unless it's to prevent wraparound. So if you get an auto vacuum and you see up in the query text, it'll say vacuum table and then in parentheses it says to prevent wraparound. If it's to prevent wraparound, it will not cancel. So if I then come up and I say, alter table, table that is being vacuumed, do anything, I'm gonna try and take out an exclusive lock that's blocked by that wraparound and every query will stack up behind it. This is what Joyant talked about in their postmortem around this issue. So there are some ways you can protect yourself against this and this is really more problematic if you're auto vacuuming almost all the time which in large databases you might be. So you can set statement timeouts. I think lock timeout was introduced in nine, three. You can sort of add this optimistic thing ahead of time. If you're running DDL to say lock no way and if that fails then just retry or whatever. But if you're running DDL or you have any exclusive locks as part of like normal operation, this can really get you. So you really do need to be careful otherwise you'll block to some critical table and you'll take an outage. And so you have to decide, we have to decide for our own applications, like is it better to block or is it better to fail? If you have something that says, I think one of the cases where it's caused problems for other people was it did like create trigger if not exist or something like that. And so create trigger if not exist, like that's gonna do it. Would you rather have that statement fail? Would you rather wrap that in a lock timeout and say, okay, well this is gonna fail for some other reason? Or would you rather that actually block? So in this particular case it sounds pretty obvious that you want that statement to fail so that traffic can continue but that might not be the case for you if you're doing some like more critical DDL that might not be quite part of normal operation. So with auto vacuums especially when you're running them almost all the time you really need to consider your cost limits and your cost delays. So these are the two things that you primarily tune in order to control how fast vacuums will be allowed to run. Now just like with creating indexes you have limited disk IO available to you for ongoing operations, creating new indexes. We also have limited disk IO for auto vacuums. So we're incentivized because of wrap around for this to happen more quickly but we're disincentivized due to ongoing operations in order to have this run too quickly and have to, if you let this, so you set it to zero vacuum, if you set a cost delay of zero vacuum will basically run as fast as it possibly can. You probably don't want three auto vacuums running as fast as they possibly can on a production system where people are trying to do stuff that might not end well for you. So lower values here are more conservative. Every cost limit operations which are values that you also configure like every number of pages that it reads it will sleep for cost delay milliseconds. So I guess in order to figure out the right values for this, you need to manage, you need to understand how quickly you're growing TX IDs. So that's like the age of the special DatFrozen XID value in the PG database view. You need to consider like, okay, I'm generating X million a day, how fast we're going to auto vacuum in order to keep that from going too far. But you also need to know how much throughput you have available to you. There's tools, FIO, Bonnie plus plus, Sysbench, et cetera, all these things that you want to run so that you know, okay, in a naive case, like in a benchmark type case, I can get X out of my system. So let's move on. And so again, the part of why we want to make this a little faster is like, let's say that you have a very large database or you're in many terabytes, 10 terabyte plus, 20 terabyte plus, whatever. Two to 31, I'm sorry, that says 231 billion, two to 31 transactions isn't actually that many. Few orders of magnitude off there. 400 transactions a second, 35 million transactions a day, you have to freeze at least 60 days, 800, 70 million transactions a day, freeze at least 30 days. And two billion is a big number, but when you break it down like this, like 400 transactions a second is really not that many and nor is 800. I'm sure many people here have systems that peak well above that and they might sit at these rates all day every day. So if you find yourself in a position where you have to scan all of your, many terabytes of data every 30 days, that's tricky, right? So you have to look at, again, come back to those cost delay and cost limit and make sure that you at least have a plan. You've at least thought about it and you know you have some sort of graph, you have some sort of monitoring, some sort of alerting that's gonna tell you, oh, my oldest transaction is 900 million transactions old, I might need to do something about that. Or my oldest transaction is 1.5 billion transactions old, you probably wanna do something about that too. So as you're monitoring this stuff, make sure you give yourself enough time to react when you start to page people or otherwise have alerts that go critical. Also, 9.6 will have, I think maybe a month or so ago, some changes were committed that will prevent auto vacuum from having to scan the entire table. It'll be able to skip significant portions of it. So when 9.6 comes, thanks to, I know Robert Haas is one of the people that worked on that and I know some of the other people who worked on that too, but I'm sorry? Yes, there he is. So high five him or something. Okay, so also we can manage GXID growth by let's try to run fewer transactions. So we already do this in our code every day. We have begin, do some stuff, commit. Typically we talk about this as a way to logically group together things that we want to all succeed or all fail. But another way to look at it is, well these things are generally related. I don't care as much if they all succeed or all fail, but I know based on my application, if this first thing fails, I guess I don't particularly care about the later operations. Maybe you wrap that stuff in a transaction so that instead of generating multiple database transactions, three, one per insert, one for the update, one for the next insert, we only generate one. And so there's another case though where it does maybe warrant a little bit more thought. So in this one, many people probably have this. You might find yourself in a position where you do some work. Then you have some sort of slow application code, either intense CPU, some sort of outbound network call, anything like that. And then you do another database transaction at the end. So we've consumed two TXIDs instead of one. So I would typically say that this is not a very good idea, but let's say that it's not one second to two seconds. Let's say it's 50 milliseconds or 100 milliseconds or 200 milliseconds of CPU intensive work. Maybe that is something where you would want to wrap the whole thing in a database transaction so that you're simply using fewer TXIDs. And again, this is like something that we have to figure out based on our applications. Another reason why this might be dangerous is if you're using any sort of connection pooling. For example, PG Bouncer running in transaction pooling mode, as soon as you say begin, you check out from your connection pool. And when you say commit or rollback or what have you, you put it back. So if you say, if you do something like this and you're in transaction pooling mode in PG Bouncer or whatever your connection pooler is, you could get to a point where all of your database connections are in use, even though you're not actually waiting on the database. So like I said, I don't typically recommend this, but I think when you do find yourself generating a ton of TXIDs, maybe it is something that you should consider in order to turn, like say in the sole application code, maybe you do five outbound network calls and you like log a small piece of data for all of them. You can add, you know, you can consume one TXID for each of those pieces of information that you're logging or you can consume one if you're willing to eat the overhead of begin and commit around the whole thing. So I guess next, I want to talk a little bit about replication. So I have a little blur from documentation here. Sorry to put a wall of text on the slide. Typically we want to operate our databases with synchronous commit. That means that when you commit a transaction, you are confirmed flushed to stable storage before returning the commit. If you combine this with synchronous standby names though, which is a feature of streaming replication, you're actually waiting on a remote F-sync. So you're waiting on an F-sync on your synchronous standby servers. So that turns out to be fairly not performant sometimes. Particularly if you write out a lot of wall very quickly, vacuum, create as concurrently, this can cause a problem because you might, you know, you might get bottlenecked waiting on your remote F-sync. So there are ways to deal with this, but again, these are things that we have to ask ourselves about like our app. As they get bigger, do we say, well, would I rather turn off synchronous commit by, or would I rather just turn off synchronous standby names, I can set synchronous commit to local in nine, two, and later you can set remote write, which waits for a write acknowledgement, but doesn't wait for a flush on the other side. I don't know. I think, you know, it's sort of the old question of would I rather be, you know, dead and 100% write or alive and 99.9, or potentially 99.999% write. These are hard questions to answer. So, oops, I didn't see that, sorry. Yeah, so the most meaningful trade off with something like this is what happens if your primary crashes, you have high availability infrastructure set up to stand up your standby, which might be streaming and typically might be synchronous streaming, but oh, you disabled that for some reason and now you find that, oh, I had 10 milliseconds of data loss or something. Maybe that's okay for you. Maybe it's not, but if you, I mean, you need to ask yourself whether that's all right and you might, you know, monitor your receive lag in order to kind of see what you think the normal exposure is. But a good feature to help with some of this stuff would be if you could have, if we had a maintenance cost delay and a maintenance cost limit, it's not something that Postgres has today. The idea being that create index concurrently instead of flooding the wall, waiting for flushes on the other side, maybe we can artificially slow that down so that it doesn't block out other kind of real time queries that are trying to, they're trying to do work. And actually also a more broad QoS might be interesting as well where you can set values like this, so you have, you know, maybe more OLAP type queries you want to tune down the priority so that the more real time stuff can go through faster if you're forced to run things like that on the same system, et cetera. I don't know what that would look like though. I've not, I'm not sure, yeah. Yeah, so there are hypotheses that other types of queries aside from vacuum would obey the vacuum cost delay and it is thought to not be true, but hopefully it will be soon. I think actually having a separate, I guess if you had, if it did obey vacuum cost delay, you could set that per session, but otherwise maybe you could, maybe we don't actually need a separate concept. But if so, then vacuum cost delay should probably be renamed to maintenance cost delay or something. So I'm trying to move a little quickly to get back on time. So just a couple other little quick things I want to touch on just at the very end. You know, obviously we should have backups, that would be a good idea, right? But really the last thing on here is really important. Reuse caution when rolling your own incremental backups. There are, you probably want incremental backups if your databases are large because otherwise your store, like maybe 50% of your database changes when you, every week, so then you take your next weekly backup, you could save a lot of space. But David Steele actually does a really good lightning talk. I don't know if he'll do it this conference, but he demonstrates this behavior where R-Sync is only accurate to the second. And so when you, if you're not using a tool that is aware of this problem, that'll like take a manifest beforehand and after, et cetera, and you don't use R-Sync checksums, then you can actually get data corruption if you just use an R-Sync to take your backup. So use a tool, use Barman, use PG-Backrest, use whatever you want, ZFS or sand snapshots if you have fancy storage. But otherwise you can find yourself kind of screwed here. And then just sort of lastly, there are great tools out here to do a lot of this monitoring. I mentioned Check Postgres before. If you have your Nagios, you have whatever you use to handle your monitoring. Check Postgres will do index bloat, dat frozen XID, it'll monitor things like the number of archive ready files. So if your, if your while archiving fails unexpectedly, you get something that's a little bit more meaningful than my X-Log volumes filling up. And then I like Collectee and Graphite because I've been using it for a really long time, but then you want us to look at these things to see like disk space, particularly your growth rate, replication lag, Postgres gives you a wealth of information there and then the PG-stat tables just generally. So the more of this stuff that you can monitor and make sure that there's not a change, because if you go look and you say, what's my lag right now? You really need to know, well, what was the lag before? Do I normally sit at one millisecond and now I'm at 10? Do I normally sit at five and now I'm at 100? The change is really what's important to look at. So these are some of the tools that I've found are really, really helpful. And very easy to use generally. So I guess quick conclusions. Postgres is a really great foundation. Like I've certainly used it for a long time, like I said. Rommel said at his keynote last PGConf with this really amazing pasta analogy, pasta as a sauce delivery platform. Postgres is an application delivery platform. So I think that's what it is. And so our applications, in order to deliver the applications as best we can, like we use Postgres, we make trade-offs, there's no silver bullet. Sometimes we say, I'd rather have more complexity and more performance alongside it. Sometimes you say, I don't care about this thing taking five extra milliseconds because developers can understand it better. So there's no perfect solution. It's always gonna be challenging. And these are just some of the things that I've had to think about I found over the last several years. So that is all I have. Any questions going once, going twice? Yeah, probably balancing auto vacuum speed. So making like, you can't be too aggressive because then all of a sudden, your real-time stuff starts to fail. But if you're too unaggressive, then your transaction ID, the age of your frozen ID slowly creeps up. And it is slow, it's not, you're not gonna find yourself, oh goodness, I'm at like 1.5 billion now. You'll say, hey, I'm like going up by a couple million every few days. And it just sort of slowly goes up. So you have to look at it like really long timescales in order to see some of that. But then when you do, this is a thing that I can do to tune it. So in a previous life, I worked on a database that was where we had max connection set very unreasonably high. It was 8.4 and I think that somebody at that there thought that it was a good idea to set it to several hundred, like eight or 900 or something like that. So that didn't work out super well. But yeah, I think I mentioned transaction pooling, specifically in many applications, you're doing a lot of stuff that's not talking to the database. So if you use a connection pooler local to your node, then you overcommit your application threads or workers or what have you on top of something like PG Bouncer or a JDBC connection pooler or what have you. So you're definitely incentivized to keep the number of active connections really far down. And I think generally speaking too, it's better to block it, like have somebody blocked waiting for a connection than it is to overload the database with too many connections. So that's sort of situational. I think you can either run it on your application server. So if you have a hundred application servers, maybe you only permit a small number of database connections per application server, and then however many worker threads you run per server, they're gated right there. Obviously you can do it a lot more efficiently if you run it on the database, but then you sort of have this other problem, which is like PG Bouncer is evented and it's not free. Like it does have overhead with it. So if you're running, you know, thousands and thousands and thousands of queries per second, like PG Bouncer will give you meaningful overhead. Yeah, so you might want to spread that out. One more, or are we good? Okay, great. All right, thank you very much.