 I just wanted to start out with a quick show of hands. How many people here run Postgres in production on Kubernetes? Awesome. How many people run very large Postgres clusters in production? Very cool. And another show of hands. How many people think Kubernetes is possible to run very large? Great. Cool. Well, hopefully, by the end of this talk, we can get more of those hands right raised. And a quick spoiler alert, we're going to show some ways to make it possible to actually restore very large Kubernetes Postgres clusters 300 times faster than you can today. All right. So welcome, everybody. My name is Michelle. I am a software engineer at Google. I have been a Kubernetes maintainer since 2017, and I am a six storage TL. And joined here. Yeah. Thanks, Michelle. I'm Gabriele. Gabriele Bastolini. I'm VP of Cloud Native at EDB. EDB is a company that contributes to the Postgres open source project. I've been using Postgres for more than 20 years, and now I'm also a DOK data on Kubernetes ambassador and an open source contributor. In the past, I created, I don't know if you're familiar with Barman. Does anyone know Barman for Postgres here? Okay. So I created Barman in 2011. It's a backup and restore manager for Postgres. And now I'm also a maintainer of Cloud Native PG, which is an operator to run Postgres in Kubernetes. Thank you. Cool. Yeah. So we'll first start off with a background in Postgres technologies on how Postgres does backup and recovery. We're going to talk about the new volume snapshot backup and recovery feature with the Cloud Native PG operator. We're going to dive a little bit into details on the actual APIs and how you use it. We're going to show a demo, and then we will wrap that up afterwards. Yeah. So in this first section, I will go through some important concepts behind disaster recovery with Postgres databases. So disaster recovery is together with high availability, one of the core components in IT to achieve business continuity. Planning a business continuity solution always starts with defining the goal to achieve as an organization. So once these goals are defined, we can shape our infrastructure and our system accordingly. But how do we define these business continuity goals? So over the past years, two primary metrics have emerged. The first one is RPO, recovery point objective, which is the amount of time that we, amount of data that we can afford to lose after a failure. And RPO is primarily a disaster recovery metric. The second one is RPO or recovery time objective, which is the time needed to restore a service after a failure. And RPO is primarily a high availability metric. So it's only through an exercise of risk management and cost efficiency that organizations find the right balance between these two opposing metrics. So one of the coolest aspects of Postgres is its innate flexibility and impeccable robustness that comes straight out of the box. So it's no wonder that Postgres has earned its reputation as the ultimate rock solid database. So as this T-shirt shows, you know. So let's see why. So going back in 2001, Postgres introduced crash recovery using ride ahead logging that marked a significant step in data durability. You probably remember, you know, the lamp stack at the time Postgres and MySQL were emerging and at the time Postgres was worrying about this stuff, not losing data more than performance. And that's what made MySQL more popular at the time. But again, I don't know if you know, a recent Stack Overflow survey revealed that actually now Postgres is the most popular database in the world. So in 2005, the introduction of continuous backup and point-in-time recovery, fortified Postgres through online physical-based backups and wall archiving, enabling effective disaster recovery. These pioneering features are the focal point of today's presentation. Over the next decade, Postgres expanded its continuous backup infrastructure to include the advanced replication system that we witness today, and primarily serving the high availability needs of an organization. So it's important to note that this presentation doesn't cover PGDump. This PGDump only generates SQL-level snapshots of the database, and these are not suitable for business continuity. So instead, we focus on continuous backup. So before looking into Postgres' backup and recovery infrastructure, let's grasp some fundamental concepts. Postgres writes data in 8-kilobyte pages inside the directory called PGData. The transactions are stored in write-ahead log files that are also known as PGWall. Shared buffers serve as a cache for better performance, and each connection with a client is managed by a dedicated process known as Postgres Backend. When a Postgres backend requests a page from disk, the page is first loaded in the shared buffers and then returned to the backend. And when a backend changes the content of a page in memory, that change is first saved in the write-ahead log, not the data files. So as you can see, the shared buffers have the content there. The information is written in the whole segment. This is the reason why this is called write-ahead log or simply wall. For better data durability, Postgres allows you to archive each wall file in another location, and this is normally referred to as the wall archive. Postgres works on the assumption that shared buffers and data files might differ at any time. It's the checkpoint process that ensures that dirty pages are regularly flushed to disk. So in brief, to ensure smooth disaster recovery, we need to focus on safeguarding PGdasa-based backups and the wall archive. These resources are essential for point-in-time recovery and at the same time serve as the bedrock for Postgres replication. So let's examine now the mechanics of continuous backup. With an active Postgres server and its PGdasa, the current wall file is consistently archived in a separate storage location, such as, for example, an object store. Files inside the PGdasa need to be physically copied. These copies are called base backups, and Postgres provides an API for taking them online without stopping the database. These are called hot physical backups. The procedure is quite simple. You invoke the Postgres API to start a backup. It's called PGstartBackup or PGBackupStore, the start now, and begin copying all the files inside the PGdasa. This process could take a few minutes, a few hours, a few days. It depends on the size of your database. And in the meantime, those files might change, but this is not a problem because, as I said before, Postgres expects that all changes are stored in the right-hand log first, and they are saved in the wall archive. So by saving the wall archive and the base backups, we're fine. So concluding the backup involves signaling the end of the copy process through the Postgres API and awaiting for the final wall file to be archived safely. For a backup to be consistently restored, you need all the wall files from the start of the backup till the end of the backup. Pictures these now. You've got your database in production. As time marches on, your database turns out wall files, recycles them, and beautifully archives them. So your task is as simple as scheduling backups, whether daily or weekly. It's really up to you. And if you do this, you basically have continuous backup in Postgres. So notably, Postgres does not require a specific implementation for copying these files. And this is very important for us, you know, because with our operator, we're now using volume snapshots. This is a generic implementation. So please follow me now as this is very important. After you have a catalogue of base backups and the continuous sequence of wall files in the archive, Postgres allows you to recover to any point in time from the end of the first base backup that you have to the latest committed transaction that is contained in the last archived wall file. So this is simple and clear. So you suppose you have a disaster now and you need to recover to the point before the disaster. You basically copy the latest available base backup in the server where you want to restore Postgres. And then, for example, you configure Postgres to recover up to the latest available transaction in the wall archive. This starts fetching all the wall files from the first wall required by the base backup and applies the committed transactions until it reaches the recovery target, which is, in our case, the end of the wall, okay? Then it promotes itself, becoming ready to serve your applications. And this feature is important to note. I just provide an example to restore until the end. This is the case of full disaster. But suppose you delete a table or you put a wrong where query in your SQL statement, you can go back to any point in time, okay? So as demonstrated earlier, Postgres is fully equipped to meet your business continuity needs. What you require is to set up three things. Regular base backups, configuration of continuous wall archiving, distribution of backups and wall files across multiple locations for enhanced global RPO and RTO goals. If you adopt these practices, you gain the ability to recover your system at any given moment. And this is a proven strategy that already many organizations have embraced in the last decade and more outside Kuberneteses. But this is QCOM, this is 2023. So let's delve into how the CloudNet FPG operator with no effort integrates all of these and conceals the underlying complexity for you. So CloudNet FPG is actually more than an operator. So it aligns perfectly with the principles outlined by Jeff Carpenter and Patrick McFadden in this book. It harnesses the Kubernetes API, operates declaratively, prioritizes observability and comes with built-in security. This robust solution includes a production-ready operator for all supported Kubernetes versions and the suite of operand images for Postgres. CloudNet FPG sets apart from other Postgres operators as it directly extends Kuberneteses to manage the entire life cycle of the Postgres database, encompassing essential day-to-day operations such as automated failover or backup and recovery. In contrast to other approaches, it foregoes the use of stateful sets, opting to manage persistent volume claims directly. CloudNet FPG was initiated by my company, EDB, and now is an open-source project that is managed by an openly governed and vendor-neutral community. As maintainers, we're committed to seeking inclusion in the CNCF sandbox. And if you want to know more, you can scan this QR code and get to the project, download it, test it, and read the documentation. But let's now examine what CloudNet FPG provides in terms of disaster recovery. CloudNet FPG stores will archive in an object store, and out of the box, world files are archived maximum every five minutes. This is your worst-case scenario for RPO. All base backups can be taken instead in two ways, using object stores or on volume snapshots using the new support for the standard Kubernetes API. And when dealing with large databases, volume snapshots emerge as the preferred choice for streamlined backup and recovery. Here's a quick comparison table outlining the backup and recovery methods between the object store and volume snapshot approaches. A crucial point to consider is the copy and write optimization offered by the storage, enabling you to leverage incremental and differential backup and recovery at block level. These functionalities prove to be essential, again, especially when dealing with large-scale databases. And this is probably one of the most important slides. Today's presentation is a summary of the benchmark results I ran on an EKS cluster. So to ensure consistency, the test was focused on base backup recovery without wall recovery. I conducted some tests across various database sizes, ranging from four gigabytes to over four terabyte databases with consistent outcomes. Notably, volume snapshots systematically outperformed object stores in both backup and recovery operations. So consider the 4.4 terabyte scenario backup speed showcased a 25-fold improvement compared to object stores. More importantly, recovery time demonstrated a 300-fold improvement over object stores. Again, underscoring the remarkable advantages that the standard API that Kubernetes provides for volume snapshots brings to the game. So now, back to Michel. All right, thank you. So let's explore what's happening under the hood. Oh, sorry, with cloud-native PG backups. So cloud-native PG is leveraging the Kubernetes volume snapshots feature. This feature went GA in Kubernetes 120. And it provides a standard and portable API across storage providers through CSI drivers. And today we have over 100 different CSI drivers available, supported by all the major cloud providers and on-prem storage vendors. And so the Kubernetes volume snapshots API lets you do three basic operations. First, create a snapshot of a persistent volume claim, delete that snapshot, and then also create a new persistent volume claim from the snapshot. And just taking a look at the Kubernetes API in a little more detail, we can see that the API here follows a similar pattern to persistent volume claims in Kubernetes. So as a user, you will create a volume snapshot object and you specify the persistent volume claim you want to take that snapshot of. You can also specify custom parameters and configuration for those snapshots with a volume snapshot class. And then from there, Kubernetes will then invoke the CSI drivers to actually go take the snapshots in the underlying storage system. And then when you actually want to restore your workload with a new disk, what you would do is create a new persistent volume claim object where you specify as a data source that volume snapshot. And now Kubernetes will go and will create a new volume and rehydrate that data from that snapshot. So that's kind of what's happening at the Kubernetes level of things. But we can see here how the Cloud Native PG API can simplify that experience. So here's the Cloud Native PG cluster specification. And we can see here in order to configure volume snapshots backup, all you have to do is go to the backup session and then just specify the volume snapshot class that you want to use for taking them. And then also if you want to also do the wall archive backups, then you also configure an object store for the barman object store backup policy. So once you configure that, then the next step is you can take snapshots or you can take backups in one of two ways. One way is you can create the scheduled backup object where you can specify a schedule. This example here is showing taking a backup once a day. You specify your Cloud Native PG cluster you want to take that backup of. And the method you choose will be volume snapshots. You can also take a backup on demand by using the Cloud Native PG kubectl command and where you just give the cluster that you want to take a backup of and the method, which is also volume snapshots. And now in a disaster recovery scenario, if you need to restore that Postgres cluster, what you'll do here is create a new Cloud Native PG cluster object and under the bootstrap and recovery section, that is where you specify the volume snapshots that you want to restore this cluster from. And here you directly pass in the names of the Kubernetes volume snapshot objects you want to use. All right, so let's go ahead and see this in action with a demo. So our demo cluster, we're going to demonstrate this on a GKE cluster. And I'll probably have to look here to see what's going on. Okay, we're going to start with a three node Postgres cluster. And we can see here, in the cluster spec, we'll specify the storage classes we want to use for the volumes. And then under the backup section, we're specifying the volume snapshot class that we want to use for our backup method. And we can also then just check which volume snapshot classes we have in our cluster. Here, we're just using a default volume snapshot class using persistent disks. And then you can see on the top, we have the scheduled backup objects, where we're saying take it back up once a day. All right, and so now we can actually, let's also look at the Postgres cluster that's currently running. We'll see the name here, we can see the number of instances. And then when we go down, we can see a couple of conditions about the cluster. We can see that the cluster is healthy, continuous archiving is working, and we can see when the last backup was successful or failed. Going down a little bit more, we can see some more details. Like we can see which pod is currently the primary. We can see all the persistent volume claims that this cluster is using, and we can also see all the other pods that are part of this cluster. And then we'll also see what backups we have currently taken. And so we can see our last backup was taken about six minutes ago. And we can also find the corresponding Kubernetes volume snapshots that were taken as part of that. And you can see here there's two snapshots, one for the main data volumes and another for the wall. And then now we're going to log in to the database to just inspect some of the tables that are available. This database was populated with data using PG Bench. We'll see here that this is a 22 gigabyte database. And we can see this has about 150 million rows of data. OK, so now we're going to simulate a disaster. We're just going to go and delete the cluster. Whoops, my bad. And so now we'll see the pods. There's no more pods left besides our client pod. And then we can see all the disks are gone. But we can see we still have our snapshots. And so now we want to restore this cluster. So let's take note of those two snapshots, the last two snapshots that we want to use. We're going to start a watch on the cloud-native PG cluster object. And then we're going to kick off this script, which will basically start creating the new restored cluster. We can see that now we're starting to bring up the primary. And let's go ahead and inspect the new cluster spec here. Here we're going to first restore the primary. And so here you can see the instances is one. But after this is done, we can end up scaling it out to the remaining replicas. And the most important part here is in this bootstrap section. Here we are specifying the two latest volume snapshots in the cluster. One for the data volume and one for the wall storage. And so it takes about a minute. And after a minute, the pod is able to come up. And now we're just waiting for it to become leader. And there we go. Now the cluster is healthy. And then we're going to just inspect a couple of the objects we see now some new persistent volume claims have been created. And if we take a look at those in a little more detail, we can see when we inspect this persistent volume claim, we can see the data source is specified as the volume snapshot. And so this volume was created with data populated from that snapshot. All right. And then now we're going to go back, log back into the cluster and inspect the database contents. We'll check the size again. It's still 22 gigs. And then we'll check the number of rows in the table. It should be $150 million. All right. Sorry, we need to restore our slides. Yeah, I'm good with databases, but not with this stuff. OK. Yeah, so now that we've seen. I'm good. Now that we've seen sort of an example of how it works today, like to just talk about some of the future enhancements that we're continuing to look into in both the Kubernetes space and the Cloud-native PG space. So first in Kubernetes, we are developing this new volume group snapshots feature, which will definitely help Cloud-native PG be able to take volume snapshot backups more efficiently because you can do things like partitioning your data and optimizing for different storage efficiencies and then also allow taking those volume snapshots in parallel. Another enhancement we're working on in six storage is the container object storage interface. This is very similar to the concept of CSI, but it's tailored for object storage when it's trying to standardize the control plane operations for managing object storage buckets. This is also, this would be very useful for Cloud-native PG as well. As you can see, it uses object storage. It manages object storage to do the wall archiving. And then I don't know if you want to talk about it. For Cloud-native PG, we are actually working on version 122, which will be released by the end of the year, which will support table spaces. So table spaces are a vertical scalability feature that will enable to improve this management of very large databases even further. And we just started, basically. We're just scraping the surface at the moment of what the volumes and options that CSI drivers are actually able to offer to us. The next step will be PVC cloning. So basically imagine this. You can scale up just by cloning volumes instead of running PG-based backups. And this will also be used for in-place upgrades. OK, so key takeaway is we've got a full open source stack now to run Postgre in Kubernetes. You've got Cloud-native PG, Postgre SQL, and Kubernetes. So you can really mitigate the risk of vendor lock-in. And the main benefit of using volume snapshots is to have, in general, better RPR and RTO goals, which, again, are the business continuity goals that we need to achieve. And they are suitable for all major cloud service providers, but they're also available on on-premise. OK, on-premise. So our advice is to check what your storage classes provide and do your benchmarks, do your tests, and if you can, use volume snapshots. The other good thing is that you can actually also mix backup strategies. You can have hybrid strategies with object stores and volume snapshots in your system. So volume snapshots basically open a new era for Postgre in Kubernetes, because thanks to incremental and differential backup and recovery, we can manage very large databases as you saw in the previous slide. And this is just a recommended reading for you. I wrote this blog article a month ago that gives you an idea about our view in terms of architecture for Postgre in Kubernetes. And there's also these other blog article that pretty much recaps the things we have, the benchmarks that I ran as part of this presentation. So you'll find these in the slides for later usage. So any questions? My question is, can you make this multi-region hot hot? Or is this limited to life form? I know. I know. Yeah, yeah. So basically, the good thing about this system, and if you read the blog article about the architecture, you'll find a lot of information on the multi-region stuff. But essentially, by using volume snapshots, we delegate data mobility to the storage class. So as long as the volume snapshot is available in the other region, that's fine. OK, we can recover. Thank you. I guess there's lunch. That's why everyone's leaving. We currently do use the operator from Zalando. And maybe I just wanted to ask you what does differentiate you from what's better, because I think that you have started quite recently. OK, that's not really OK. So the question is, primarily, what's the main difference between the Zalando operator and our operator? So the Zalando operator is actually the second Postgre operator I've written. And we are from Italy. We know Zalando people a lot. And that's actually the first time I saw Postgre in Kubernetes. We took a completely different approach. With the operator, we actually tried to bring Postgre in Kubernetes. And for example, we didn't use Patroni. We didn't use Barman or PgVacrest. So we didn't use the tools available in Postgre, but we actually decided to leverage what's in Kubernetes. So that's why I say it's more than an operator, because it's actually also a failover management system. It provides tabs for observability, backup and recovery, as you see now. And I said, this is just the start. So it's fundamentally a completely different approach that we took. We waited. We started four years ago, actually. We released it open source a year and a half ago. And we waited for local persistent volumes to be basically ready. If you read again the story, you see that we actually, before starting this project, we benchmarked running Postgre's on bare metal. We didn't find any difference, because that's the way we used to do it. We've been managing some of the largest Postgre databases for years in the world. And so we wanted to make sure that it could work on bare metal as well, and it does. And volumes natures are also available on bare metal installations. So it's essentially a completely different approach. We don't use stateful sets. We use persistent volume claims. That's why I think we're able to do this stuff as well. It helps us for rolling up upgrades, pod disruption, budget control, all this kind of stuff. So have a look and join the community. And if you've got more questions. Thank you very much. Thank you. I would like to thank Leonardo. Leonardo, please stand up. Today it's his birthday. Leonardo is a maintainer of Cloud Native PG. Any other questions? Yeah. So thanks. This is amazing work. Just had two questions. Can volume snapshots be taken off cluster and restored to a different cluster as well? And the second one was regarding the Cloud Native PG operator. It's sometimes useful to see a list of backups. If you've taken hundreds of backups in a long-running cluster, is there a way to do that with the operator? I'll reply to the second. Maybe if you want to reply to the first. Yeah, sorry. What was the first question? If you can move it. Sorry. So the first question is, is it possible to move the volume snapshots to a different cluster and restore it on a different? Let's take it on one cluster and move it on to a different cluster. Yes. You'll need to double-check the actual storage capabilities. But for the most part, when the objects in Kubernetes are just references to the snapshots in the underlying storage system, so you should just be able to take those snapshot objects and just import them into another cluster. Yeah. And for the backups, you can actually get the list of backups. And we put a lot of labels in the backup objects by default, so you are able to, for example, see what backups belong to a cluster, the date they were taken, we've got several labels and annotations in both the volume snapshots and the backups. For example, it's important for the volume snapshots that we also store as an annotation the spec of the cluster. So the spec of the cluster is stored in the volume snapshots, so you only need that to restore everything. That's why this also integrates with Kubernetes-level backup tools very well. Thanks. OK. Thanks. Another question. Yeah. Cloud-native-PG, advantage is over, but OK. So essentially, Cloud-native-PG, it's Postgres. Then all the benchmarks. Basically, our exercise has been to bring Postgres in Kubernetes. So now we can say, in my opinion, I will never go back to VMs or bare metal to run Postgres. To me, this is the best way to run Postgres that we have. And essentially, the same things that we were applying on bare metal and VMs, now you apply them on Kubernetes. You need Kubernetes skills. And to reach the TPS, you know, transaction per seconds, it's really an exercise of customizing Postgres to your organization, which is different from any other organization in the world, and run benchmarks. But this is Postgres. I mean, the test that I showed before, I was able to run to achieve 1,100 transaction per seconds on the 4.5 terabyte database using synchronous commit and remote apply. So if you know Postgres, you know what it means. It means that I write a transaction, and not only I wait for the standby to store it locally, I also wait for the standby to apply and see it. So essentially, you have read consistency between the primary and the standby. These are amazing results. I mean, you know, the vertical scalability in Postgres is something that we should not underplay, OK? And when we are tablespaces and partitioning, I think this is another interesting talk maybe for next year or for Paris, you know. I look forward to it. There are no stateful sites here. So how do you scale up and down? We cannot remove random part, right? I mean, it's all done by the operator. The operator treats every instance like any other. So there's no primary or one that is better than the others, OK? So it's, again, treating like a cattle instead of pets, OK? We bypass the stateful set concept, and we control directly the persistent volumes, OK? Because, I mean, we've written this stuff for Postgres, as you saw 15 years ago, OK? So we have put all the logic inside Kubernetes where we have more control than before, than ever, actually, because applications that reside where the database is benefit from having a single authority that controls the network and primary routing as well, OK? So that's why, I mean, we've never experienced this complete brain with CloudNet RPG. And when you talk about failover management, that's quite impressive. But one thing I suggest that you look at is the amount of end-to-end tests that we continuously run in our pipelines and, yeah, and how we build the operator in general, you know? But, yeah, so basically, we just control everything. We hibernate the cluster. We can fence the instances. So even the rolling upgrade procedure, we first upgrade the standby, and then we give you the possibility to do the switchover or simple restart of the instance and do it automatically or manually, OK? So this is thanks to directly working with persistent volumes. All right, any other questions? Cool, thank you, everyone. Thank you. Thank you.