 I think I'm going to already start to introduce you to the next speaker. If you've been already to Dev Room in PostgreSQL in Boston before, you have seen one of his talk. It doesn't matter because he's always talking about different stuff. I've been to like five or talk from Christoph before and I always learn something new. So I'm really happy to be hosting this session. He is one of the VI users. As you know, VI is the Roman number for number six, and this is his sixth presentation actually in Boston. So please welcome Christoph Pettis. So thank you very much. I don't want to get into a VI versus Emacs more here. So let's see. Yes, I'm Christoph Pettis. I'm the CEO of PostgreSQL Experts. We're a small PostgreSQL consultancy from European perspective San Francisco. We're actually in Alameda across the Bay and there's my email address there by Twitter. So just a quick overview of what we're going to talk about here. Postgres can handle a database of any size. The largest community addition Postgres database that I personally worked on was multiple petabytes. So when people say, well, I don't know if Postgres can handle that. It's like, yeah, it can handle it. It's fine. But how you use Postgres operationally changes as the database grows. That's what we're going to talk about. So because what works for a one gigabyte database does not work for a 10 terabyte database or a multi-petabyte database. So let's talk about it. The number one question I always get at where the slide is going to be, they're going to be at my personal website, which is this, vbuild.com. The structure of this presentation is around database sizes, and how Postgres changes as the database size grows. So we're going to start with a 10 gigabyte database, which is your first database. Up to 10 gigabytes is like a nice small really manageable database. It's hard to do anything wrong at 10 gigabytes. You can pretty much get away with anything. Pretty much anything will run fast even if you do sequential scans for all your queries all the time. It'll zoom right along. Even these pathological joins where you have multiple join arms and this and that and all this stuff. Unless you're doing a cross join across the entire database, so you're doing n squared on a few million rows, everything's going to run fast. The stock Postgres vbuild.com will work fine. You don't actually have to change anything in it. You may want to, but you don't have to. So how much memory do you need? I mean, one of the questions that's always, the first questions I get asked as a consultant is, so how much memory does a Postgres database need? I mean, we have this 10 gigabyte database and we're a little concerned about performance. And I try to make sympathetic noises on the phone. If you can't fit your 10 gigabyte database and memory, reconsider your life choices that got you to this point. I mean, 10 gigabytes, it's like I have it here. I have a flash drive that will handle that. Even little micro instances on cloud hosting services will handle this just fine. The entire database can probably fit in memory and probably should. And in fact, one of the biggest problems, of course, that people run into is that everything works just great at this size. And then as the database gets bigger and bigger and bigger, things start falling apart. And they wonder what horrible thing they did wrong that the universe is punishing them by their database falling apart. And the answer is nothing. It just got bigger. But even sequential scans will just zip right along on a 10 gigabyte database. So how do you back up a 10 gigabyte database? I mean, backups are important. We all want to have backups. And the answer is, just use PG dump. This is not the point that you need to start thinking really, really hard about your detailed backup strategy. Just run PG dump for the command line. Cron job every six hours, whatever. It takes 90 seconds on this laptop to back up a 5 gigabyte of database. PG dumps fine on this size. So just start. The important thing is do the backup. Because you probably don't want to type a 5 gigabyte database back into your system. You don't need anything more sophisticated at this point. Just let it happen. Stick the backup files in cloud storage. Use S3 or Backblaze B2. I like B2 a lot. And you're done. Problem solved. So high availability on a database of this size. You have a primary and a secondary. You can either use direct streaming, just standard stream replication, or wall archiving, depending on how these two systems are connected. Maybe at this point, just do manual failover. Don't worry about super sophisticated failover. One thing to note is throughout this talk, I'm using database sizes kind of a proxy for a lot of things, like how many reads you're doing and how much or what your write activity is like, and how many customers you have. Now, you can reasonably object that, well, I have a small database, but it's super mission critical and really important. And I don't want to carry a pager. I don't like to carry a pager. So I need something more sophisticated. And the answer is read ahead in the book to that point. Start adopting things earlier then. So because everybody's operational system is just a little bit unique. So that's OK, except that. So tuning, you say, well, yeah, you said you could just use a PostgreSQL.conf, but I want to do something a little more interesting than that. OK, if you must. The first thing is the usual memory related parameters. And I will say, if your database fits all in memory, I do like to tune some query parameters specifically for all in memory databases. But this stage, just keep it simple. Don't go crazy with it. Pop open postgreSQL.conf and there are 300 and blah, blah, blah, I forget how many parameters. And you think, oh my god, we're all going to die. If I don't set every one of those to a value that took me hours of agony to do, and the answer is no, there's like four that you need to tune at this point. I like to set this for databases that can fit entirely in memory. And this is also true even if it's a giant database, but you just have a giant server to run it on. So you can, we can talk endlessly about why this is exactly so, but this works out for me. Shared buffers to 25% of system memory, work mem 16 megabytes. Why 16 megabytes? Why not? And maintenance work mem to 128 megabytes. These are basically a, you're probably not going to run yourself out of memory unexpectedly settings. Let's see. Log destination, do, to, do, to, do. One of the few things that's not great, in my opinion, about the stock Postgres configuration is the logging. Stock Postgres configuration is fairly terse on its logging. And there's a lot of really useful information at this stage you should just get. It's a small database. You're not going to choke yourself with log volume. So turn on CSV logging, unless of course you're on RDS, where it won't let you. And I just like to keep daily logs and that. Log min statement duration of 250 milliseconds, that's kind of like a standard good number for OLTP web front end systems. Set it appropriately. If you can set it to zero and log every single query without choking yourself, great. Do that because you want that information. But you can kind of play with this number until you get the log volume you like. Do log all checkpoints, all connections, all disconnections, unless you have a really, really, really noisy connect in and out kind of thing. And that creates a huge log volume. But you should probably fix that problem rather than turn the logging off. Otherwise you're kind of sticking your fingers in your ear and say, la, la, la, I don't have a connection problem. Lock weights, again, if your logs are being choked by a huge volume of lock weights, fix the lock weights, don't turn off logging. And same for temp files and auto vacuum. So just cut and paste this. You want to upgrade. Do a major version upgrade. Minor version upgrades on Postgres R. You shut it down. You install new binaries. You bring it back up. You're done. You do any maintenance activities. Release note says, you go home. That's fine. But for a major version upgrade, you have to do something. A little more sophisticated. On a database this size, just do a dump or store. You'll spend more time fooling around with almost anything else. And you're done. But do it. One of the major version upgrades, it is easy to push off major version upgrades because they seem like a pain in the neck on Postgres. The problem is, the longer you wait, the bigger the pain will get. So just do it. Accept the fact that you have to do major version upgrades on Postgres and proceed in this way. And just get into the habit of planning your upgrade strategy. Don't think, well, we'll wait some time. Just don't do that. OK. So you've achieved great success, and now you have a 100 gigabyte database. Oh, that's nice. That's not huge. That's not like what I think of as huge. But it's starting to get bigger and will fit into memory on most standard instances. You can certainly get instances that it'll fit into memory completely, but in this order of magnitude is where you start exceeding that. And your queries might start getting a little bit wonky at this point. And PG dumps take too long to take or restore. So using PG dump as a backup strategy is starting to become less attractive, especially on the restore side because the system fails and you're going to spend a couple of hours maybe, depending on the overall throughput of the system. And that's not fun. So how much memory? So how much memory does a Postgres database need? This is like the question people ask me. And the answer is, I don't know. It depends, and it depends on a lot of different factors. So I'm going to give you a rule that's completely unjustifiable, but it seems to work out. So if you fit the whole thing in memory, great. Obviously, you never think, wow, I just gave that Postgres database too much memory and performance is horrible now. A good rule of thumb is, can you fit the largest one to three indexes into memory? There's a Postgres tuning parameter effective cache size. It's a hint to the planner. It's not a memory allocation. The Postgres never goes out and tries to allocate effective cache size worth of memory. It is an estimate of the total amount of memory available for Postgres as a cache. It includes the shared buffer setting, and it includes the file system cache on the machine. Generally, you want to get an instance where you can set that number, being honest, to larger than the largest index. That seems to be, certainly the planner likes to see that. If you can't, more memory is always better. But just remember also, if your problem is right performance, all the memory in the world will not help your right performance. That's not the right performance is not assisted by throwing more memory at the box. OK, well, we wave farewell to our good friend, PgDump, because it's just not being very fast anymore. So it's time for PITR backups, point in time recovery backups. As a quick overview for people who are not familiar with them, in a PITR backup, you take a file system level copy of the whole database, once in a while, which can be from daily to weekly to however long. And then save all of the right-ahead log segments that are generated after that complete, from when that file system copy started. And those two things together are your backup. You still have to copy the whole database somewhere. So you are, in fact, copying everything, but you don't have to do it as frequently. And you can recover right to the end of the last right-ahead log segment that got saved. So the advantages, first of all, you don't have to do one of these giant copies nearly as often. And you can recover to the current point. For doing this, I like PgBackrest. It's from mostly written by Crunchy Data Systems. It has a lot of nice features. Walle kind of revolutionized this whole thing by making it easy to do this into cloud storage. You can roll your own. I mean, the number of the steps and people did for years. But at this point, don't bother rolling your own unless you have really specialized needs, because these have a lot of package features that will help you. So again, PITR backups takes an entire copy of the file system, plus archiving the wall segments that are generated. The more often you do the file system copies, your restore is faster because it has to process fewer of these right-ahead log segments on restore. But it means you have to do this large copy. If it's 100 gigabytes or 500 gigabytes, that's probably acceptable. But this does take a while. The other benefits are, as I said, you can restore to a particular point in time. And you can also use this to prime secondary instances for stream replication. So you get an advantage out of that. Generally, at this point, you want to let go of those old tuning parameters. Generally, sequential page costs back to 0.5 to 1. Random page costs more in the 1.1 to 2 range, depending on your underlying storage. Share buffers, you don't have to change. But you can throw more maintenance memory at the problem. Manious work mem is the amount of memory that's available for things like index recreation and vacuuming. At this point, you probably want to get a little more analytic about how you set work mem, rather than just saying, ah, 16 gigabytes were done. Because you were clever and turned on temp file logging, you can see if there are temporary files being created by queries. And then you have an idea of how much, if there are, you know that you need more work mem and can bump up work mem. You can set it to like two or three times the largest file that generally gets rid of them. Now, of course, but what if it's saying, well, I needed an 18 gigabyte work temporary file? Then find and fix that query. Do an explain, analyze on it. Find the sort node that's creating a 16 gigabyte file. Now, in some of these things, hang on a moment, I am depriving the system of its notification privileges while I'm talking. There we go. That's enough of that. Now, sometimes you just accept this. If it's a big analytic query that doesn't run very often, maybe it's OK to use it, burn up a lot of this, or start thinking about more memory. If this is a data analytic system, you probably want to have more than 16 gigabytes of main memory to do your giant queries. Also at this point, maybe your read capacity is getting to the point that you don't want all the reads directed at the primary node that's taking all the writes, too. So you consider moving read traffic to streaming secondaries? Be aware that replication lag is non-zero. This is actually something that is worth thinking about very early in the design of your application. Because if your application really heavily relies on read after write being instantly available, your options are to understand that the read may not pick up the data that was immediately written, or you have to move to synchronous replication, which you probably don't want to do, because that is a huge performance sink. You usually only want to use synchronous replication for data safety, not so that your reads after writes work correctly. If you can, it's nice to have in the app stack that it knows whether or not this is a read or write operation and directs it properly. Some apps like Django make this fairly straightforward to do. If not, you can deploy a PG pool, which does this in the open source world. PG pool, too, handles this redirection. It's kind of quirky, so be aware that you're going to invest some time into setting this up properly. One other thing about this is this is also the point that you may want to start thinking about connection pooling. A lot of app stacks, especially those that are container deployed, you fire up 5,000 containers, each one of which opens 10 connections to the database. And so you're starting to set back connections to 7,000. And of these, eight are active at any one time. This is a very, very common. That is an actual real world thing I've seen, is 5,000 open connections, of which the peak was 16 active. You probably want to look at PG Bouncer for those situations. This is also kind of the time you want to kind of get real about monitoring this database. You don't want the monitoring to be other than somebody's adding you on Twitter, saying your site's down. You know, that works pretty well these days. But I have, in fact, learned sites were down from Twitter before my monitoring fired. So at a minimum, you know, this is not, I'm cheating here. I don't, processing logs through PG Badger is not monitoring. I think monitoring is a real time thing. But, you know, I didn't want this talk to go on forever. So at the minimum, this is the point that you should think about processing logs through PG Badger. PG Badger is a Perl tool that ingests logs and produces these nice HTML, very management friendly reports. Install PG Stat Statements, which gives you more real time statistics about query performance. And if you don't mind picking up an external tool, PG Analyze is a kind of cool hosted tool that samples from PG Stat activity and other things and produces a web-based front end. And, you know, the usual suspects, New Relic, Datadog, Visual Cortex, blah, blah, blah, Vivid Cortex. These all have Postgres plugins. You know, use your favorites. You want basic health. Is the database up? Are system resources being consumed? Are you paging IO or CPU? That kind of stuff. Check these for slower queries, assuming that you aren't seeing this directly in your app metrics, like slow response time. This is the point that you're just going to start really missing indexes. As a philosophy, I don't like to pre-create a lot of indexes unless they're necessary to enforce constraints. Because especially if you're using things like ORMs or other kinds of front end tools to generate schemas, it's really easy to just say DB index true on everything on like, well, I might query on that at some point. Don't do that. Indexes are not free. They take up disk space. They take up significant amount of insert time. And they add to planning time, although usually that's not the big bottleneck. So add indexes in response to actual query patterns, not just because you think it might be useful eventually. There are some indexes that you'll always need to enforce constraints, like primary key indexes and things like that. But don't create other indexes just because. Just don't slap slapping indexes on anything. For example, adding seven indexes to a table will slow down insert time by about a factor of 15. So you may not want to do that unless you really need them. You probably also want to get a little more serious about high availability at this point because just the text getting a page isn't going to be fun anymore. At this is the point, you start looking at some kind of tooling for failover. PG Pool 2 does have tooling for failover. You will spend a fair amount of time getting it right, but it does work. Petroni is pretty good, especially if you're in a cloud container environment for doing this kind of thing. It uses HA proxy as its front end for things. One downside of Petroni versus PG Pool 2 is you have to separate the read and write traffic for Petroni. PG Pool 2 doesn't set me automatically. And of course, there are hosted solutions like RDS and those guys that handle failover for you. Honestly, the failover high availability situation is probably the big unique selling point of RDS and these kinds of things because getting this right on the community edition is a little bit is not perfect these days. It requires some work. RDS, it just drops it and works. But be prepared to pay a really big RDS tax on your monthly bill. OK, upgrades. Well, we couldn't do PG dump and PG restore for backups, so maybe we don't want to do this for upgrades anymore also. So now we use PG upgrade. I never remember if there's an underscore in this or not. Which there is. Thank you. There's an underscore. I'm assured. So it's a nice tool. Ships is part of the distribution. It can make a copy and upgrade to a copy. No one ever does that. They always use in link mode, which does an in place. What amounts to an in place upgrade by playing with hard links? Very nice. Very reliable. It's the only real caveat is, especially if you're doing large jumps, extensions can be a problem. If the extensions that come in the contrib directory, you're fine. But some third party extensions, you need to be a little bit careful about doing them. Postjust is the biggest example of this. Because if you're doing a big range upgrade, you're jumping from like 9.2 to 10 or 11, you have to be careful to make sure the Postjust versions are compatible. No reason you can't. For the amount of time I spent fiddling the command line to get it exactly right, I could have done a PG dump and PG restore and gone to lunch. So it works just fine. So your database has grown to 1 terabyte. You've added an order of magnitude. This is where a lot of discontinuities happen. Just observationally, the database starts feeling different at 1 terabyte. Now you have a real database. You just can't get enough memory anymore. You're not going to throw. If you want to have a 16 terabyte, and I've seen these, 16 terabyte thing with the octocards lined up and all the NVMe cards lined up, that's great. Most people don't have a quarter of a million dollars to drop on a single machine. So you have to start making some choices. The queries are starting to fall apart more regularly. Queries that previously performed well are starting to just suddenly go crazy and run very slowly. This is the point you may start running out of read capacity on a single machine, that you're starting to peg IO. And doing a full PITR backup is taking a long time, because you're having to copy an entire terabyte worth of data onto cloud storage. So get as much memory as you can afford. More memory will be better. If you're running a data warehouse kind of situation, you will need more memory to get acceptable performance because of the kind of queries you do than if you're running a transactional OLTP type database. So if management says, well, it's OK we want to consolidate all of our 23 transactional systems to this one giant data warehouse, just kind of prepare them for how much the check they're going to write at that point. I wonder how many generations before the term check they're going to write. People will like that. And IO throughput becomes much more important here. Because, again, the working set is going to fit in memory, it may very well not fit in memory anymore, so it's just going to have to go out to secondary storage a lot more. This is a place where you might consider moving to fast local storage from slower sand-based storage, especially if you're on Amazon. How many people are running Postgres on Amazon right now? Well, that has gone way up in Europe. I remember when I first started asking that question, Europe, with one hand, went up. Because, first of all, EBS is not very fast. Second of all, even with PI ops, it's not as predictable as one could hope. And you pay for every IOP on EBS. And that can get very expensive. So you might start considering instances that use primarily local storage because that is really super fast compared to EBS. So backups. You can start doing increment. At this point, incremental backups are getting more important. PG Backrest does do incremental backups out of the box. They're on a file level basis. So if you are routinely touching every page on three rows across every table, you're still going to be copying fairly large 1 gigabyte segments. But it's still better than it was. Again, you can roll your own with Rsync, but this is extra for experts. Be careful about this, because it's very easy to get this wrong, backing up a Postgres database using Rsync. At this point, you probably want to bump up these guys and checkpoint time. Because with the amount of data that's being changed on a regular basis, these checkpoints are going to start getting pretty big. And when Postgres is a checkpoint, which is this periodic point that flushes all of dirty buffers out to disk, this is the high IO point for Postgres. So it's very helpful to tune it a bit. Don't bump up shared buffers. Now, this is a very controversial point. Everybody has a very strong opinion about what shared buffers will be set to. I have never measured a significant improvement on any system, on any workload above 32 gigabytes. So that's my experience. Every time I say this, people are much more experienced about Postgres than I are going. But as soon as I see numbers that show a performance improvement at that point, I will change this slide in a heartbeat. But my experience has been that increasing this slows down checkpoint performance but doesn't increase query performance significantly. Maintenance work mem is an interesting thing. This has just crossed my field of vision a lot lately. There's a tendency that, well, we have this giant system. In this case, it was 512 gigabytes of main memory. And we have some big tables. We'll set maintenance work mem to a really big number, like I think it was set to 20 gigabytes. And auto vacuum is like never finishing. And interestingly enough, we then turned it back down to 512 gigabytes, and auto vacuum finished much faster. The reason, and this is counterintuitive, but really high values of work mem do not improve performance significantly compared to more modest ones. Because it's filling them and spilling them on a regular basis. So again, purely experientially, if the most of your indexes are larger than a couple of gigabytes, it's actually improves performance to decrease it to like 256 to 512. So the cycles on auto vacuum are faster. So read replicas at this size become very, very important. And at this point, operationally, it's very helpful to distinguish between the failover candidate, which is a designated instance that stays very close to the primary, doesn't take queries, and its only job is to failover and take over from the primary and read replicas. Because on streaming replication, there is a trade-off you have to make between the query load that it accepts and how close it will stay to the primary. And at this size database, on smaller databases, having one that does both is probably OK. But at this point, your read load is probably getting to the point that the read secondaries will start experiencing significant delay enough that you don't want to lose that data to the case of primary failure. And this is also the point that you probably want to switch to some kind of configs code, whatever your favorite hotness is, for spinning up and tearing down these secondaries. So you can spin up secondaries without having to go through a lot of steps. Ideally, you just push a button and bang. You get a new secondary after a while. We'll be instant. But you don't have to do this. You can configure these manually. If you do this earlier, you'll have saved yourself trouble. But this is the point that becomes important. This is also the point where for a lot of sites, you might want to start offloading services. Like you move analytic queries off of the primary database so that you're not crunching the primary database. This is often a point where you might consider doing logical replicas for data warehousing and for analytic queries. So you have a system that's specifically intended for that. This is also the point that maybe you want to start moving things like job queues and other non-low retention period items out of the primary database and onto dedicated systems. Use Redis or something like that for these, rather than keep them in the database to reduce the amount of load the database is taking. And this is usually the point that people start having problems with vacuum. So vacuum can start taking a long time here on big tables. The first thing is, don't increase the number of auto vacuum workers unless your schema is big. Because each worker can only work on one table at a time. So even if the tables are big, throwing more workers isn't going to solve the problem. Now, for example, you have a system that's client-based like sharding and you're creating a new schema for every tenant. And so you end up with 25,000 tables. Then, yeah, for sure, increase auto vacuum workers. Let vacuum jobs complete. I'm not sure my company would be in business if people would just let their auto vacuum jobs complete. Because we'll come in and they'll say, oh my god, we're getting wraparound warnings. And they say, well, OK, what happens? Well, yeah, these auto vacuum jobs keep coming up. They keep killing them. Well, you just paid us several thousand dollars to tell you to not do that. Happy to help. So the number one thing is these jobs will take a long time, especially the very first auto vacuum to prevent XID wraparound. Everyone's seen this in PG Stat activity. The very first one of those jobs on a large table will take a long time because it's having to pick up and shake every single page. I'd probably write it back down. The subsequent ones, assuming you're on 9, 6, and higher, will take much less time. So just plan for this. Understand that these will have to run. Be careful with long-running transactions. If you're using Two-Phase Commit, stop right now. Don't do that. It's bad. Well, all right. But if you're using Two-Phase Commit and you don't have an external transaction manager that was written specifically to handle the situation, don't do that. Because these can persist in the database for a long time or really screw up your vacuuming. And other things like manual table locking, another one reason I like to encourage people to move queues and things like that out is very frequently the code that's written for that locks tables manually and really screws up vacuum, especially on job queues, which really need vacuuming a lot because there's a lot of insert and delete activity. Don't turn off auto vacuum ever. Just full stop. Always leave auto vacuum on. But it is possible to create workloads where auto vacuum has trouble keeping up. Very, very, very high update rate tables, for example. Or again, job queues with lots of inserts and lots of deletes. For those, it's sometimes, especially at this size scale of database, thank you, it makes sense to start manually vacuuming those tables. But don't do that in lieu of auto vacuum. Do that in addition to auto vacuum. There's a script if you go to pgexperts.com or you go to pgexperts.com GitHub. We have a script that does opportunistic freezing of big tables. You might take a look at that. My favorite first way of adjusting auto vacuum is using auto vacuum cost delay. If it's taking too long, you can consider making it more aggressive. So you make it really super aggressive and then discover it's destroying your system and you say, oops, and then you make it less aggressive by turning the dial the other way. But that's the first dial to turn. There's a lot more to be said about tuning vacuum, but this is kind of a breezy overview talk. Again, let it run. You can get yourself into real trouble if you don't. Indexes. And the indexes are getting really huge now. This is a good point also to consider partial indexes for specific queries. For very common cases, you have an e-commerce system. There's orders. You have a bagillion orders of which 1% of which are actually active at any one time. Throw an active flag on a Boolean active flag on the table. Create partial indexes on that. So when you're doing queries, you only have to consider the ones with active equals true. This is a good place also to go back and analyze which ones are really being used and drop the ones that aren't. PGstat user indexes is a very good friend here because they'll show you how many times indexes are actually being used for queries. And it's often very illustrative to go back and see that there are lots and lots of indexes, none of which are actually being used. Queries can start becoming really problematic on these. Even the best queries can take a long time to run against a much larger data set like these. And one big discontinuity is a query that used to be doing an index scan, suddenly is doing a bitmap index scan, bitmap heap scan, and taking much, much longer. Because suddenly the number of rows that'll be coming back crosses a threshold, a mysterious interplanar threshold. And it thinks, well, this will be faster. And maybe it's right, but the overall query gets much slower. One way, this is about the time you should consider, might want to consider partitioning tables. Look for tables that can benefit from partitioning. Sorry about that. For example, things that are time series data, where things are being accumulated like events and things, sensor events, things like that, if you're not using a dedicated time series type database, so that these giant multi-billion row tables can be divided into much, much smaller, more manageable chunks. If you're using 10 or greater, by all means, use the new partitioning mechanism. It is so much nicer, that old-style partitioning. The big rule on partitioning is just making sure the data has a strong partitioning key, which is a relatively invariant key, not the primary key. It doesn't have to be the primary key, usually isn't, that is set when the row is created, almost never changes, and is used as either a equality or a small range on every query. That's what makes a good partitioning key. If you're not using query parallel execution now, this is a good time to turn it on. Bump up the number of query workers and the per query parallelism. If you're handling a large result set that's doing large sequential scans and similar things, this can be really, really powerful. You can get a lot of benefit out of it. Just make sure the IO capacity can keep up, because you're going to have multiple jobs, all of them doing a lot of IO. This is also a place to start thinking about adjusting the statistics target on particular tables. One of the reasons these queries can start falling apart is because the visibility the planner has into the data can start becoming kind of obscured. Because as these tables get much, much bigger, you're still only dividing it into the statistics on it into 100 buckets. And you're starting to lose resolution. This is very common if you have foreign keys on long values like UUIDs or strings. So look for queries where a highly specific query is planned to return to a large number of rows. For example, you're querying against a field that's UUID. It's not a unique field. But you're querying a specific UUID. And you know that only seven or eight rows will come back. But the planner thinks, I don't know. I think 25,000 rows will come back. That's usually because the statistics on that field need to be bumped up. Don't go crazy, though. Don't set your default statistics target to 10,000 in Postgreskill.coff. Because we will all be dead before your analyze completes. Also consider alternative indexes. Some fields are really not good matches for B tree indexes, like long strings and range types and things like that. Use indexes that are appropriate for the type. For example, if you're storing a URL which has lots of entropy at the end and not a whole lot at the beginning, hash indexes can be really good for that. Because also, why would you ever, you don't get range queries with a hash index. But why are you doing a range query on your URL? That's what kind of free cars is that. Upgrading a really bigger database. PG upgrade, as long as you put in the underscore, still works fine. Because PG upgrade, the time is proportional to the number of database objects, not the database size if you're running in link mode. So now just be prepared for that, though, if you have one of these 25,000 table schemas, that will still take a while. If that's not acceptable, consider a logical replication-based upgrade where you fire up a new instance, do logical replication to it, and then do a quick cut over at the end. Just be sure to plan for major upgrades. Don't be like one of our clients, which has a one petabyte database still on 8.3, because they're trying to figure out how to do the upgrade. Ooh, 10 terabytes. Big. You have bragging rights now. This is a good database. You'll have to make some hard decisions, though. Backups, at this point, anything that involves, I am going to copy the entire database into a cloud storage system is getting a little impractical. You might consider moving to file system snapshots, like ZFS or sand-based snapshots. You have to copy it somewhere, because obviously, if you leave it on the same spinning disks, that's going to be problematic. But at least you don't have to do it right. You can space it out a little bit. Just a note, tablespaces are a pain. Please don't use them. All right, use them if you have to. There's this thing about, I can always tell when a Postgres database was set up by an Oracle DBA, because there are like 23 tablespaces all pointing at the same spinning disk. Because on Oracle, people just told you, do that. Don't ask why. Just do it. And there's almost no reason to set up a tablespaces, especially now in Postgres, in the old days where you would have SSDs for the indexes and spinning disks for the main storage and stuff like that. But generally, everything is running on some sand where you have no control over the performance. So what the hell? There are some specific reasons. You can do this like fast load storage. You're reaching the limits of a single volume, things like that. But just understand that they'll complicate backups and replication. Index mode can be a significant problem at this size, because it's harder to claim space on indexes from an index than from the heap. So sometimes you want to write scripts that will recreate an index and then drop the old one to compact them. But do this on the basis of real analytics. See how bloated your indexes are getting. But this can also change query plans, having a badly bloated index. You got to this database by writing to it. So you frequently have really high update rates against a unique index. And these can create locking issues if you have a lot of parallelism. If you have like 50 workers all writing to this. IoT style applications, which are recording sensor data, can have this problem. If you have especially clustered keys like serial, which tend to all be hitting the same index pages. One thing that we have gotten great performance out of is if your keys are guaranteed to be unique, like serial or something like that, you know it's going to be unique because or Snowflake or one of these guys, or UUIDs. You might consider just dropping the unique constraint, get some performance benefit out of that. Or switch to UUIDs, if you don't mind. Of course, if you're generating a huge number, you might start running out of entropy from the, the contention may be on the random number generator. Your right capacity at this point might start being constrained. This is the point that sharding starts becoming important. There are all sorts of options. I like Citus DB a lot. There's PostgreSQL if you want to become part of a development community. You can do your own custom based sharding, you know, so regional or application based sharding. The nice part about sharding is done right. It'll also significantly accelerate your reads because it can farm the query, large queries out over multiple shards. Just, you know, understand that your admin complexity is going to go up significantly here, so just take a deep breath and be ready for it. Okay, huge databases, you know, just PostgreSQL can handle really, really big databases straight out of the box, community edition, no extensions, just drop it in. Just, you have to make some choices. You know, the thing is at this stage, each, everything's a unique animal. So your questions are going to be things like, what's the working set? If most of the data's archival performance will be more manageable, but if it's archival, why didn't you archive it? Get it out of the database. Separate the data into a transactional system of data warehouse. You probably want to have done this like two orders of magnitude before, but really do that now. And this is logical, replication's great for that. And now is the time that you start doing really fun stuff, like large scale sharding. You know, instead of having just one giant database or closely connected nodes, you can do geographic or enterprise sharding, splitting your databases up based on geography, on enterprise functionality, things like that. If you want to get really advanced, you can start doing multi-master tools. You can use Bucardo or Second Quadrants BDR for doing multi-master on Postgres. Or you start thinking about data federation, like moving archival databases to all other data stores, move it to Redshift, Green Plum, one of those guys. Or even into cold storage, big old S3 buckets. Or if you want to get really fancy, you can start using foreign data wrappers to federate databases with a single API. That's fun. Or just run big, small databases inside the same Postgres instance. So, in conclusion, Postgres is amazing. Thank you. Thank you for coming to my TED Talk. It can handle everything. It is amazing that one product, using the same binaries, can handle this range of data. That is pretty astonishing. The nice part is it'll grow with you. And it's important not to over-tool your installation and go crazy at a very early stage because you want to, in part, you don't want to implement more than you understand. But always keep an eye out for, when we hit that next order of magnitude, what are we going to do? What's our roadmap for going down this? You don't have to go much farther than one order of magnitude, but keep that next order of magnitude in mind. And thank you very much. Thanks, Christoph. We do have time for a couple of questions, so please raise your hand and wait for the microphone so that we get it on tape or digital stuff. Hello, thanks for your talk. What's your strategy for estimating the size of the working set? Damn you. I've been busted. The answer is I don't have a consistent one. It's usually, like particularly slow queries, I can look at what they're doing or how much or page replacement on them. I wish I had a tool that would say, the PG working set, and I've read it for a while, it says 83 gigabytes, have a nice day, so. But it's one of these things that I kind of have to reinvent for each time. I wish I had a better answer. Okay, do you have all the questions? Yes, I have another one there. When you talked about partitioning tables, I didn't acoustically understand what you said. The second criterion for the column that you choose. Right, of the partition key. You said it should be a column that is set when the row is created, and there's something that you said about the range around the. The rule, a good set of rules for partition key are, it's relatively invariant. It doesn't have to be 100% immutable, but it should not change all the time, because then you might have to move rows between partitions, that used to be hard, now it's just expensive. It should divide the data into relatively even partitions. If it divides it into two partitions, one of which is this and one of which is that, that's a bad partition key. But the most, and usually people can get that fine. The hardest part is it needs to be used in essentially every query, either in equality or as a narrow range, so that very few partitions are selected by the query. Because if you don't do that, it has to scan every partition and unify the results, and that's worse than just one big table. That makes sense. And one last question here, somewhere over there, or not, ah yes, that one. And again, when I have a chance, the slides will be uploaded to that site. Thank you for talking. Don't you think that complexity, or for big data, for huge data, that complexity to use Postgres is much bigger than you can use the metadata on top of your storage to build some system which, how to handle this data on top of standard Postgres, let's say. I don't think I quite got the question, I'm sorry. Was it a question or a comment? It was a question that if you need to handle with really big data, you should build a system with metadata on top of Postgres. Well, I'll turn to the question is, if you are handling huge data, should you build a system using metadata on top of Postgres? And the answer is, sure, there's no inherent problem with that. It depends on, I think that depends a little bit if your data, if this is a consolidated database against a very large number of data sources and source schemas, then building a metadata system will be much more valuable than if it's, you know, sensor data from one kind of sensor. You know, for example, if all you're doing is sensor data from one brand of wind turbine, that's to use a real life example, there's not a lot of metadata there because, you know, it's like one table, one schema, just that one table happens to have, you know, have 83 billion rows in it. So at that point, it would just be more work than it was worth to build some metadata system. But if your system is consolidating, say, voting records against, you know, across the whole EU, you know, where everybody has their own systems and complications and things like that, then a metadata system would probably not just be useful and probably be necessary. So thank you very much, Gustav, once again, please.