 I'm going to talk about Postgres backups and disaster recovery today. A little bit about me. I work as a database administrator at OmniTI Computer Consulting for the past 1.5 years. Previously I interned in the same company in summer of 2012. I just recently completed my master's in computer science and that's my blog and that's my Twitter handle. I don't blog that often but I should probably start. So here's a little bit about what my talk is. First, so say your managers tell you that any backup that's been taken in less than 24 hours is useful. So that means that backups older than 24 hours can be removed and you should always ensure that your backups are within the RPO. Second is recovery time objective which is much simpler to understand. It's basically the time that your restore process takes in case of a disaster. So you should probably test your backups and your restores so that when your manager or your client talks to you in disastrous situations that how much time it's going to take you can tell them exactly that. These are basically the types of backups. There are two types of backups. There's logical backup and physical backup. In logical backups, there's PG dump and PG dump all. PG dump is great. PG dump all, yes, also great but it has its disadvantages which I'm going to talk about in the next slide. Physical backups, there are two subcategories in this. There's online physical backup in which you don't need to shut down the database and it's very common and there's offline backup which is rarely used because you need to actually bring the database down before you can take the backup. There's uses for offline backup which I'm going to talk about later but if your database is actually a production machine that needs to remain online 24-7, online physical backup is what you want. So logical backups, I'm mainly talking about PG dump here. It has multiple levels of granularity. You can choose if you want to take the backup of a single database, a single table, a single schema, multiple tables you can pick and choose. You get a lot of luxury in terms of what exactly you want to take the backup off. PG dump is a great tool. It's got all these options that you can look in the documentation. What's more important is that there's an option in PG dump that's called compression. So that allows you to either go with a plain compression that is no compression at all, plain text dumps and others like C and stuff and tar and zip. They allow you to compress your backups but the thing with taking plain backups is unless your table is really small or whatever object you've chosen is really small, restoring it can take a lot of time. There's a tool for restoration that's included in Postgres called PG restore. And the thing with plain backups is you cannot use PG restore to restore those backups. So I would suggest that unless your table or your object is really small, go ahead and just include compression in your PG dump command. Just so you can take advantage of PG restore. Option to select specific objects for a store. PG restore has an option that allows you to list out the objects that you want to restore from a backup file. So even if you've taken the backup of an entire schema, you can mention which table you want to restore. And that's only possible when you tell PG dump to compress your backups. Multiple inbuilt compression types, I've already talked about it. Ease of use, no extra setup required. It comes with the core installation of Postgres and it's great. I'm sure most of you already use it but if you don't, you should look into it. Disadvantages is the first thing compared to physical backups. It tends to be slower and it's frozen in time. Point-in-time replication is the ability to tell your database till what exact point you want to restore it. And that advantage of that crate is it doesn't back up large objects. It cannot create binary formatted backups that is compression isn't allowed. And it cannot take advantage of faster parallel restore with PG restore. What I just explained, since there's no compression in it, it's not possible to use PG restore. So if your database is really large, it will take a lot of time to restore. And recovery time objective is what I talked about earlier, so that will be higher. Having said all this, PG dump all is very useful when you're using PG dump for actual data backups. And PG dump all to create dumps of globals like roles and permissions. So that's when PG dump all really comes into play. This is just an example of how you'd set off a script that does logical replication. The script is a bit complicated because what it actually does is it loops through a file that details every database in the cluster you want to take the dump off and details the options in there. So the dump command, that statement basically just issues a PG dump command for each of the line in the file that it loops. What I wanted to show here is a fairly complicated system of PG dump that happens nightly can be concise into such a small script. So if you just want to take a dump of a single table or a single schema, it's going to be way more simpler than that. So it's really handy and it's really simple to set it up. We come on to physical backups now. Advantages, it is faster if you want to restore an entire cluster. It allows for point in time recovery that is you can mention exact time till which you want to restore your database. Incremental backups, which is again a very great feature. It helps in storage management of backups that I'm going to cover later. And it's also a major reason why and how you can speed up your recovery process. Good default compression for certain file systems. The file system that I'm basically going to talk about is ZFS. But it's not the only file system that has the advantages and the support required for you to go ahead and set up physical backups for your databases. Disadvantages, it cannot restore specific database objects. So the advantage that you had with PG dump to pick and choose exactly what you want to backup is not there with physical backups. It will straight away backup your entire data directory or file system. Here's an example of how you'd go about setting up a physical backup script. This part of the script is basically just the functions and the sequence of functions that you'd like to set up. Just read off the command line arguments first. Sanity check deals with telling, asking your system what kind of access it has is everything working, is the SSH working. The machine that you want to send the backups to is the connection all right. And stuff like that, it places a lock file, there's a reason for that. So one thing with file system level backups is that you need to have your wall option turned on. There needs to be the archive mode in your postgres.conf file needs to be on. So that you can tell, you can provide for point in time recovery. And so once that happens, you can go on to checking if your database is offline or it's online. This is one of those subcategories we saw earlier. If your database is online, it needs to put the database in backup mode before it can take a consistent snapshot of the whole database. So once that's done, it takes the ZFS snapshot and then it stops the backup mode. Now I was talking about a log file and the role that it plays. Postgres has two modes in respect to when to take a backup. It has postgres start backup function and it has postgres stop, pg stop backup function. Before you can take a snapshot, you need to issue these two. The start backup function in the database. And for that, the thing with that is you cannot issue multiple start backup functions at the same time. So if the script is already running and another script tries to run it on the same database, it will fail. So having a log file or any other mechanism here that prevents that from happening really helps. And then the else is just if your database is offline, you don't need to worry about anything else. You can straight away take the snapshot and then transfer your backups, clear your files and temporary files and such, and remove the log file. Any questions in the script? I'm going to move ahead. And there's two main functions that I really like in ZFS. It's ZFS restore and ZFS rollback. I'm pretty sure equivalence of these functions exist for other file systems too, not all but some. ZFS rollback is basically telling your file system to roll back to a particular snapshot. This can happen on the same machine that you're working on. I had the opportunity to use it not so long ago when I was testing upgrades for postgres, upgrading to 9.4. And something went wrong in my test. And all I had to do was, since I'd taken a snapshot just five minutes before, all I needed to do was to issue a ZFS rollback and the snapshot name. And it took a fraction of a second and my database was restored to normal. So it's that simple. And that's definitely a certain, not your traditional backups, but it's certainly something that you can use in case you do something wrong. So that's a pretty brilliant feature that I found. So why do you need multiple types of backups? Why both logical backups and physical backups? I think it's pretty evident if you have a database level corruption, for example, you'd want to restore the entire thing from scratch. So you definitely want to go with physical backups because they are faster. The compression is higher, transfer of backups would be faster, and things like that. But what if somebody just dropped a table and you just got to know, there's a problem with a table or a schema? You definitely not want to go ahead and restore the whole database or the whole cluster for that. So you want to have logical backups at hand for it. So yes, both of these have their own advantages and disadvantages. And together, they would ensure that you would be ready to handle any kind of disaster recovery situations. So it's always a good idea to have multiple kinds of backups, not just one. I'm guessing all of you manage or have at least one kind of backup, but you should definitely look into having more backup validation. So unless you know that your backups work, it's no use. And day before yesterday, I was talking to someone and he said that he has multiple types of backups for his database. And that part was really great. But then he said that because he had multiple types of backups because he did not know which one would work. So he was just hoping that one of them would work. That shouldn't happen. So you have backup validation. Why is validation important? To communicate effectively with the clients and your managers about their expectations and your estimates. If their expectations are such that you just cannot do it at the time when you need it most, everybody will be disappointed along with being stressed. Procedure and external factors. It's never as easy as just issuing a restore command. There are other factors that play file permissions, accesses and stuff like that. Dev database restore isn't just the testing part, but also that your devs get fresh data to work with. So it's advantageous to them as well. Reporting databases. So if Dev databases aren't that easy to restore all the time, reporting databases are the perfect candidate to test your restores on. Often reporting databases are used for certain hours of the day. And during nighttime when nobody is at work, you can just basically shut it off, refresh it every night. And for most companies, this is actually a business policy that their reporting databases should be refreshed every day. So that's a very good opportunity to test your daily prod backups. This is a snippet of a log file from a reporting database. How your log file should look like if everything works according to plan. It starts by testing the backup. It restores the file system. This is the file system restore backup you're seeing. It starts postgres and then it does a whole database wide validation by doing a vacuum on the database. And then it tests and returns and okay. When things go wrong, the error, you should set things up in a way that if an error occurs, it should email you or page you. We're going to talk about alerts sometime soon. Okay, first we're going to talk about backups management. What does management involve? There's retention period to take care of. There's security and then there's location. We'll talk about each one of these. What is a retention period? It's basically just putting the rules in your system about how old can your backups go? So you can have a one month long retention. You can have a one week long retention. It all depends on the storage that you have on your backup server and communication with your clients. There are three main kinds of retention period that you can set up. Of course, there can be others, but the first most common type of retention period rule can be to remove all objects older than x days and y minutes. So remove all backups that are older than two weeks. And that's your retention rule right there. Second is remove all but latest x backups. So this is fairly similar to the first one, but the implementation is a bit different. So for the first one, you can just plainly use m time and then the days, number of days or number of minutes. With the second one, you specifically tell the system to retain only the latest three or five backups. Why do we have the second type of retention is because sometimes when you are very short on space and say you have five backups already on the backup server and the sixth one starts running and starts getting transferred, you run out of space every time. So you can either schedule your retention period to work at a time where say just before the backup start, run my retention rule of with m time and just remove the fifth oldest backup. But sometimes the clients can be picky. Sometimes even you feel like, okay, so for a certain amount of time, I'm just going to have four backups that kind of violates the policy here. And so to avoid that from happening, you can just, you can hard code the command that says that, okay, at any point in time, maintain five latest backups. The third one is if you're doing incremental backups along with the whole database wide backup, you can have separate policies for incremental and full backups. So the way that I have seen this work is you take, you're doing file system level backups. You take a full backup say every 10 days or once a week. Let's say you do it on Sundays. So the rest of the week, every day your database backup is only incremental. So say that your database something wrong happens on a Wednesday night. So how you typically start restoring your database would be to first take the full backup that was done on Sunday, then restore the incremental backup on Monday, Tuesday, Wednesday, and then you're good. So but what happens is all these retention periods are there just for storage management. So when you're running out of space, another alternative is to split the retention period between full and incremental backups and say something like we retain full backups for a longer period of time. But we are only going to retain incrementals for the past one week or past 10 days. So that can probably save you a lot of space when you see that you're letting go of say six or seven incrementals in one go. So that in crunch times when you're very short on this space, that this can be your savior. This is just an example. The first one is using display in M time. But the second one that ensures that at any time you have X latest backups, it uses right now I've just used said that checks that any full file has to be nine days or newer than that. The oldest is nine days. So that's the second retention period that I was talking about. Security, the most obvious thing is to first ask your sys admin that where you want to transfer your files is legal. It's allowed, it's company policy. Don't just go about setting a password list SSH to just about anywhere. Do that and secondly, which is a little more complicated is if some of you have experience working with multi hosting environments. The way it basically works is there are multiple clients running on the same physical server. So what happens is security is a big issue there. And if you have a single backup server that you're providing as a service, you're storing all client backups in that server. So essentially you don't want the individual client servers to be able to connect to the backup because the backup has a lot of client backups and there's permission issues there. So in those cases, you'd want the backup script to run from the backup server, connect to the client zones, make the backup and bring it back. So essentially there has to be one way communication between the backup and the shared hosting environment. This is something I had to deal with and that's why I felt like I need to talk about it. Off-server versus off-site, this is basically the location information that I was talking about. There's a difference when you say you're storing backups off-server and you're storing backups off-site. Off-server can be anywhere within your production environment. So your production database runs on server A. But if your server B is in the same environment and it's empty, you can just store backups there. But that doesn't mean that the backup is stored off-site and that there's multiple, it's more secure and stuff like that. At the end of the day, it's in the same physical location. Off-site is specifically telling that your backups are being stored in a different physical location. So that in an event of a disaster that's magnitudes higher than what we usually deal with, you'll still have a backup in a geographically different place. So an example of off-server I've already given examples of off-site backups are tapes. And Amazon S3 is what I've used and I've found it to be very convenient. So if you have your own systems for your production databases, you can just send off backups to S3 and it seems to work pretty well for me. Alerts, this is another type of management that you have to keep in mind. What kind of alerts should you schedule on your backups? So the first thing is to alert on backup scripts and cron jobs. The advantage of having placing alerts on your scripts and cron jobs is that it's immediate. As soon as an error happens, you get emailed, you get notified and it's good. The stuff that these kinds of alert finds out is if there's a sudden permission issue. For example, you've moved your master database to a different server and the permissions that that server has isn't the same as the first one did. It's easy to miss it out. So as soon as your script isn't able to connect it, errors out, you'll be notified. Alert on storage and retention. Error and retention logic implementation. So for example, if your retention rule is to retain for the last one month and you have certain backups are failing and you only have the last three weeks, it should alert you. It should actually alert you as soon as it's less than 30 backups per month, technically. The disadvantage is such an alert is delayed because it actually waits for the backup to get in place before it can check for these. And yeah, you can place these pretty much in the same place that you can place the first ones. Alert on backup validation. The log that I showed a while ago, so those kinds of alerts help. That is as soon as your restore has gone wrong with the reporting database, with the dev database, there should be a way to notify you. I'm going to talk about S3 backups a bit for off-site backup management. It's ideal because it's cheap, it's got a lot of place, space. And there's option to transfer the backup in part. So if you have a network connectivity issue in between, you don't have to start from scratch again. There are tools that manage this that split the backup in parts firstly and before they start sending it. And then they just continue after a network issue. It's secure, it's got a lot of policies which can be a bit daunting if you're setting it for the first time. But once you're done with those definitions and bucket definitions and permissioned accesses and definitions, it definitely pays off. Communication, the tools that you can use is AWS Cli and S3 command. I've used S3 command more, so I'm going to talk about it mainly. But the AWS Cli API works very well too. Access keys make sure that you can access your bucket and the bucket can access you in case you want to restore stuff. And encryption. I'm going to talk in detail about the bucket and user policy. When you go on to your AWS console and click on users, you will find a tab there that says policy. What that is essentially doing is that's telling you that enables you to define what buckets a user can access. For those of you who haven't used S3 before, buckets are essentially like directories and you can have different permissions on each of those. So for example, if you have three databases, three separate clients to take care of, you can create three buckets on S3. And you can have separate users for each client and set up permissions accordingly. So there's ample amount of security in place to ensure that you're not violating anything. So here's a sample bucket policy, what this is, the stuff in red. It's basically telling that this user can access these resources. Those resources are actually S3 buckets or directories, if that's easier to understand. Oh, I'm sorry, that's actually the bucket policy. So there's two kinds of policies that you need to set up. One is from the point of view of the bucket, that what can access me? And the second is from the point of view of the user, what I can access. Now usually in Unix systems, there's only one way in this regard that is the user gets to define that what it can access. But in S3, the bucket itself can prevent any user from accessing it itself. So this is the bucket policy that's basically telling that this user can access me. And this is the user policy that's telling that I can access these buckets. There is plenty of examples given in AWS documentation but it can be confusing. So all I intended here was for you to get an idea about how your policy should look like, what exactly a user policy means, and how exactly is it different from a bucket policy. It pays off to set up policies on both ends, just to be sure. So just like logical and physical backups here are main set of functions that you'd like to have in this respective order. So first you'll check for a lock or not. The lock isn't a crucial factor here but just one. Check when it was taken and then alert you if it's older than two days. So if the latest backup isn't taken today or yesterday, just send me an email. It's that easy and definitely you should go ahead and set this up. Automation, why should you automate? So far we've talked about the kinds of backups, the backup scripts involved, the accesses, security keys, cron jobs. It can be very difficult to manage all of this manually, especially when you have to deal with multiple databases and multiple machines. So obviously, like most other things, automation pays here too. Another scenario where automation can really come in handy is after a database failover. So you can automate things like permissions and SSH keys so that the moment you transfer your master to another database, another server, all of this is taken care of and you don't have to go in checking off a checklist about the stuff that you need to manually check or set up. And of course, it ensures that your systems are uniform and reliable. The main things that you should look into automating is your actual backup scripts, your cron jobs, your permissions and accesses and your restores. The fourth one really depends on the environment. If you have reporting databases, of course you should go ahead and automate the restores. But if you just have a dev database, you should probably first discuss every few months with the devs and a few weeks and see when is the perfect time to do it. An example of automation, this is actually a chef cookbook. So it's basically listing out all the files that are involved in my S3 automation. So you can see here there are a number of backup scripts. There's the last one is the configuration file for S3 command. And the second last one, sorry, the PG dump upload is the one that actually uploads files to S3. So once you're set up with this cookbook or if you use Ansible or Puppet, it doesn't matter which manager you're using. But once you're set up with this, you're really the overhead of moving around, failing over stuff really goes down. PG base backup, it's fairly new and it's a great tool. It's included as part of core Postgres and so it's not external. It doesn't have any external dependencies. It's just like using PG dump and PG dump all. It's a great alternative to implementing external file system based snapshots. Internally manages to take consistent base backup without need for stopping or starting backups. So usually when you're taking physical backups, you need to explicitly issue start backup and stop backup commands before and after taking the snapshots. But PG base backup does this internally. And because it does this internally, it means that you can issue multiple PG base backup commands at one time. So for example, you failed over your primary to a secondary and now is the time for rebuilding your slaves. And if you have multiple slaves, you can issue multiple PG base backup commands so that every slave is getting rebuilt at the same time. So that's a huge advantage. Of course, it comes with the side effect of being not so efficient if you're using it for multiple locations at the same time. But you can do this. And that's only because there's no external start and stop backup requirements for this. And backups can be made on master as well as the standby. Disadvantage, it's lower when compared to snapshot based file systems. So if you have ZFS, definitely go ahead and do and schedule ZFS snapshot based file systems. But if you don't have a file system that has those features, PG base backup is probably the way to go. Backups are always of the entire cluster. That's another disadvantage. Unlike PG dump, you cannot pick and choose. But that comes with any physical type of backup. There are a few requirements that you need to have set up before you can use PG base backup. A super user or user with replication permissions can be used. If your user is not a super user and it does not have replication permissions, you cannot use it to connect and run PG base backup. Your archive command parameter, actually that's archive mode. And it's not as much of a requirement. That is, it will work without it. But it's always good to have archive mode set to true. And your max wall senders should be at least one. If you already have streaming slaves, it should be one more than what you already have for PG base backup to work alongside. So in short, it needs one value of max wall senders just for itself to work. Another type of backups that can be, so remember when I talked about ZFS rollback as a very different kind of backup? Non-traditional, so this is another non-traditional kind of backup, delayed replicas. So sometimes, say again, in reporting databases or if you have just another box that's just sitting there, it never hurts to set it up as a delayed replica. Why? Because say you have set the delay time limit to three hours, for example. And something goes wrong on the production and a big table just vanishes or somebody did something wrong on it. You can always use the delayed replica to take a dump and restore it. Of course, this comes with the disadvantage that it has a very short span of time before which it's useless. But really in production scenarios, you kind of tend to know as soon as things go wrong. So it definitely helps to have these. Another non-traditional, not so common type of backup is having DDLs version controlled. So if you've got a lot of functions in your Postgres database, it never hurts to make backups of them and keep committing them to SVN or Git or any kind of version control tool. Aside from having backups, it also helps in debugging. So if a function that was working well yesterday has gone wrong today, all you need to do is do a diff on the file and see what changed. So that can be a huge boost to your productivity and debugging opportunities. PG Extractor is a good tool here, you can check it out. It uses PG Dump for its backups, but it segregates each database object into a file of its own. So managing those in a version controlled directory is way easier. Accessing those is very simple as well. Then I'm going to just mention that documentation is important. We've talked about a lot of stuff. We've talked about what kind of backups we need, what our backup scripts would look like, what our accesses should be, what our alerts should be. So for you, once you set it up, it seems very obvious. But for a new person, it might not be. So just make sure that whatever you're setting up, if it's a new thing, you just write down each and everything, what access a backup server needs, what access a client needs to be able to send off the backup, and stuff like that. These are some of the references that I've used in my talk. And just going to go with questions now. I run Solaris and Linux. Yes, not Windows. No, no, not on production. Yeah, yeah.