 Thanks for the warm welcome, that's great. Welcome to Burning Down the House, a quick guide on how you can burn down your data center and hopefully put it all back again. So more specifically we're going to be talking about Backup and Restore of Cloud Foundry. I'm Henry. I'm Therese. We work on the Backup and Restore team pivotal and in this talk we're going to cover the current approaches to backing up Cloud Foundry, what we've found people currently do, the problems with those approaches, our solution to Backup and Restore and then what the future of Backup and Restore of Cloud Foundry and Bosch releases looks like. So why bother backing up Cloud Foundry at all? It seems like a reasonable question to ask. Well one of the first things we find operators tend to do when they become customers when they start to use Cloud Foundry is ask about disaster recovery, ask about Backup and Restore. Because if you're running a mission-critical apps on Cloud Foundry then you need to have a very reliable way of backing up and restoring your CF. Fundamentally we think that every company is a software company and if you're running your mission-critical workloads on it you know you need to be able to recover not just your foundation but your apps, your data services and everything around it. And we also find a lot of organizations use Cloud Foundry in their dev workflows. So if your Cloud Controller API is offline even for half an hour that means that your developers who are very expensive are unable to push their apps, your CI, CD pipelines aren't working. So what kind of things can go wrong with Cloud Foundry? The first set of issues we'll look at are like data corruption kind of problems where you might want to roll back. So that might be a failed upgrade of your Cloud Foundry deployment, user error like a few months ago, GitLab's this admin dropping the wrong database and finding the backups are all empty, all security issues. So more and more we're seeing ransomware as an attack vector and the easiest way to get back from a ransomware attack is simply to repave your entire deployment and just roll back using a backup. And then there are like more serious failures, kind of hardware failures. So this is your storage attached network dice or your data center fails, i.e. it floods. And we consider these last ones, these hardware failures, kind of a superset of all the rest. So that's what we're focusing on and what we focused on in this project. So what does it actually mean to back up Cloud Foundry? It's a complex distributed system. What do I need to take out of it so that I can put that information back into a new deployment and have it all just work? So that really begs the question where does the state in Cloud Foundry live? So if you imagine Cloud Foundry is having just these two components. So this is just a slice of Cloud Foundry looking at the Cloud Controller and the Diego subsystem. So a Diego brain, some number of Diego cells. This segment of Cloud Foundry, this view of it requires two data source function. You've got a relational database of some sorts. It could be an internal MySQL. It could be an external service that you connect to remotely and a Blob store, just like a flat file store. So that could be S3. It could be an internal NFS Blob store, whatever you like. And these are the canonical places that Cloud Foundry stores state. So if we have a CF user, they CF push their app, the spring app here. They push their app bits to the Cloud Controller, which will save some metadata down to the database about that app and then save the app source into the Blob store. Next, Diego sees that it needs to run an app. It receives a request for a long-running process. It pulls up the build packs and the app source and stages the app. So it essentially compiles it into a droplet. And then once you have your droplet, which is then saved, details are saved into the database and the droplet itself is saved into the Blob store, then Diego can run it. So it pulls the app up and then stages it. So if we want to back up and restore Cloud Foundry in a consistent way, we need to make sure that we have both of these data stores. And we have to make sure we take it back up from them at exactly the same time or in some consistent way. Because if your database points to a compiled droplet that's actually not in the Blob store, Diego is going to try and start that app up and find that it's missing the droplet and can't run it. And crucial to this process is the fact that Diego is an eventually consistent system. So if you take this state, put in a fresh Cloud Foundry that has no apps running in it, Diego will see, oh, I meant to have a whole bunch of apps running and it will begin to spin them up very quickly. Obviously, Cloud Foundry is more complex than this. It has a whole lot of other components. Some are stateful, most are stateless, and we'll just converge on the desired state. The only one I've not mentioned is the UADB, which has user auth information. So that's the state in Cloud Foundry. But what about Cloud Foundry itself? What about the VMs, the actual software running on those boxes? We need to go deeper. So as I'm sure you will know, Cloud Foundry is deployed on a control plane provided by Bosch. Bosch is responsible for deploying the Cloud Foundry VMs. And Bosch actually looks quite similar in some ways. So you have the Bosch director, and then you have a relational database and a Blobstore. So if you're an operator, you want to deploy something on Bosch, like a Cloud Foundry. First, you upload your release to Bosch, which is just the app, the release source, essentially, saves metadata about that release in the database and then saves the entire release in the Blobstore. Then you deploy. Your deployment manifest gets saved down, and then it gets retrieved along with release information, and then Bosch will compile your release. Sorry, we'll pull the release bits up from the Blobstore and compile your release for your desired architecture and then deploy your VM. Bosch also then saves down the compiled release to your Blobstore for faster redeploys later on. So again, we see a very similar picture. If we want to take it back up with a Bosch director, we need to make sure we have both of these data stores, and they need to be in sync. It's no good having Blobstores in the Blobstore without any records or foreign to the database because your deploy is going to fail. Bosch is also an eventually consistent system. So if you put all this data back into a Bosch director, it should stand everything up again. Although Bosch has a meltdown threshold, some of you may have seen, at which point it just gives up if it thinks something too catastrophic has gone wrong. So this is sort of what the picture looks like, but there's also actually more complexity because you also have data services. So you have your Bosch director and your Cloud Foundry and your data services, all of which need to be backed up, and probably all of which need referential integrity between each other. So actually backing off all these things up in a consistent way is quite tricky. So when we started with Backup and Restore, we looked at what existing companies are doing. Because companies aren't just doing nothing, they all have their own solutions, often hand rolled. So the first one, and potentially the most common, are tools that reach under the hood, so to speak. So these are often binaries or scripts that know a lot about Cloud Foundry, and they know that this version of Cloud Foundry runs this version of MySQL internally, and therefore needs that particular version of MySQL Dump, and will just reach into the appropriate VM, run a MySQL Dump, pull the data out. So these will work, which is great, but they're very fragile. As soon as anything in Cloud Foundry changes, as soon as that version of MySQL changes, then your version of MySQL Restore is not going to work. So you end up with a tool that has a lot of knowledge about Cloud Foundry, and you end up having to maintain that tool, and yourself needing a lot of knowledge about Cloud Foundry. Backwards compatibility is a problem. If those versions change, then you have to have additional migration layers on top so that you can restore to, say, a newer version of CF, and consistency is not fixed by this, right? You still might end up with a backup of your Robstore taken a few seconds before your backup of your MySQL, for example, and those records are then inconsistent. Some companies do replication. So Active Active is just where you have two foundations, and Active Passive is where you have two, but one is in a passive kind of hot standby failover mode. And, again, this works really well, but it is very difficult to get right. Speaking to someone on the Cloud Ops CU team, it's found that you often end up with a lot of sort of cruft in your IaaS layer, because when something goes wrong, operators will tend to just rapidly jump on and try and fix things, and maybe not do quite the right thing in the heat of the moment, and you end up with, like, bits of state that shouldn't really be there. It obviously doubles your cost. If you have a 50-cell foundation, then you're going to need to, now, run 100 cells at least, and it doesn't mitigate against malware. It's not really a backup at all, because if you push, some user pushes a privileged app that's malicious, then it could damage your foundation, and then both are damaged by it. We also looked at snapshotting. So IaaS-level snapshots are these IaaS primitives where you can just ask your IaaS to take it back for the disk, and it's quite nice because the VM itself doesn't know anything about it, so it just happens, kind of out of band. But that's also kind of a weakness as well, because if the VM doesn't know a snapshot is being taken, then you might find that data is not being flushed to disk. So you have your Redis VM that has lots of stuff happening in RAM, you take a snapshot, you effectively just pull the disk out from underneath the VM. All that data that was in RAM hasn't been saved down because it didn't know it was about to be backed up. Also, you don't need all that state. You know, if I'm backing up a Redis machine, I want the Redis RDB file. I don't want Redis itself, all the logs, the whole OS. All this stuff is just, is cruft, and I have to store it. It's also very slow for some IaaSes, and you still have problems with consistency. Another option is just, like, do nothing. Allow it all to burn down. If you script your CF deployment in the first place, well, then if everything does fall over, you can just bring it back up again automatically, and then you can recreate your apps, sorry, your orgs, users and spaces, using scripts, and then CI pipelines can just repush all your apps. So that's great if you can get it to work, but it requires a tremendous amount of discipline. Not all teams have, that discipline, not all teams have pipelines, and Cloud Foundry is also a fun environment to play with. It's nice to just be able to push an app, but if you do that without a corresponding pipeline to push it, then that app is just gone, if the CF fails. The time to recover is very long. You have to stand up for CF and repush your apps, which necessitates recompiling them all, and if you use services, then you're out of luck. None of that data is backed up, because they're on separate Bosch releases. There are some other themes, other things people think about when looking at backups. How do you manage artifacts? So how do you encrypt them at rest? How do you schedule backups? How do you deal with retention? Presumably you want to expire or back at some point, because if your blob store is a terabyte, you don't want to be keeping that forever. Do I take partial backups? Do I take complete backups? Do I want to just incrementally back up everything that's changed since the last backup, or do I have to back up the whole thing? So we end up with a kind of a quite large problem space where we look at data consistency, how restore works, how we do encryption at rest, forwards and backwards compatibility, and it's really quite tricky. And Therese is going to talk about how we set this. Therese? Thanks Henry. So we took all the things we learned when looking at other solutions for backup and restore, and we came up with a new model. So starting from the problem space, we divided the issues into two categories. One category, how to orchestrate, includes all the concerns around the backup. Things like encryption, scheduling, artifact management. The other category is how to do the actual backup and restore. So what data needs to be backed up, how is the data backed up, and what does a restore look like? So we've assigned these categories the role of orchestrator and component, and we've crafted a contract between them. So we end up with a backup and restore framework based on the contract that sets out the requirements for the orchestrator and the component. So now I'll talk through what the contract entails. The orchestrator expects the component to implement the hooks to do the backup and restore, lock, backup, unlock. The first requirement is for the orchestrator to trigger those hooks in a prescribed order. So for backup, lock, backup, unlock, add for restore, lock, restore, unlock. The orchestrator is also responsible for moving the backup artifacts after a backup is taken and before restore. So the other side of the contract is the component. If a component wants to be backed up and restored, it has to implement the backup and restore hooks. A component will implement the hooks that are appropriate to it. So if it doesn't need locking, it can just implement backup, same for restore. And when it takes a backup, it has to put the backup in a very particular place so that the orchestrator can find it and on a restore take the artifacts from of the same place in order to restore. It's also possible to specify the order of locking. So if multiple components implement locking, it can say this thing needs to be locked before this thing. Then that's optional. So here are all the things that are in the contract. So the contract separates the backup orchestration from the actual backup and restore logic. And the backup and restore logic is written by the people who wrote the component. So these people understand how the data is stored, what a correct backup looks like, and what a correct restore looks like. And because the backup logic is packaged with the component, it doesn't get out of sync. So it's shipped with the component. And so the contract addresses the fragility and compatibility issues that Henry described earlier by encapsulating the backup and restore logic within the component. There are other benefits to the contract as well. So firstly, correctness. By asking a component to do its own backup, so say you're Redis, right? You can flush things to disk and then create a current backup with all of the data. Secondly, consistency. Because you have locking, because you can quiesce data changes while the backup is being taken across components, you can have the consistency that Henry talked about as being required between, for example, the database and the blob store. Also, backup and restore can be smart. So it can choose to process the data before a backup is taken. It can choose to only backup some of the data. You can do a lot of really clever things. And the other benefit of the way we've designed the contract is that the backup artifact is transferred after the unlock script is called. And so that minimizes the downtime, you know, the time that API is unavailable. So now I'm going to talk about how we translated the contract into a real world implementation. And that's Bosch backup and restore. So we call it BBR, Bosch backup and restore. The orchestrator is the BBR binary and the components are the Bosch jobs in a Bosch deployment. So the unit of backup becomes a Bosch deployment, which makes a lot of sense because that's how operators think about the software that they deploy. So the BBR binary is a CLI, it runs on a jump box, and it knows how to trigger backup and restore hooks on both Bosch deployments and Bosch directors. So BBR backs up state, so it'll back up elastic runtime, and cloud foundry software, it can back up a Bosch director. But when you do a restore, it's like MySQL, where you have to stand up MySQL first and then put the data back in. So you stand at cloud foundry and then you restore the data. So earlier Henry showed you where state is stored in cloud foundry. Now I'm going to walk through what happens when an operator backs up cloud foundry using BBR. I'm only showing a few components here, but they sort of cover the range of implementations of the scripts. Okay, so you trigger backup and then the first thing is all the lock scripts get called. In this case, the cloud controller implements the lock script, and what that does is it stops requests to the CF API. Next, BBR will call all the backup scripts in no particular order, which are implemented on the jobs in cloud foundry. So the cloud controller will generate backup artifacts for its data in the database in CCDB and for the Blobstore. So the GoRooter is an interesting case. It only backs up one table and then Diego, because it has, it can regenerate its data, doesn't back up anything at all. So then BBR will call the unlock scripts and in the unlock script, the cloud controller will restart the CF API. And then finally, BBR will copy the backup artifacts back to the VM where the BBR binary was triggered, typically jumpbox. And then your backup is finished. Okay, so I'm going to show you a demo, but it's a video. I sped it up and also I have really bad luck with live demos. This is what, this is backing up Elastic Runtime, which is Pivotal's cloud foundry. So in this case we're using the ops manager VM as a jumpbox because it has access to the ERT network. So we've passed in the Bosch Director IP and the login and the name, the Bosch name of the deployment. You can see that BBR outputs quite a lot of useful information, like what it's doing at every step. And then you can also see that the artifact gets copied after the unlocked scripts are called. And then after the backup is taken, BBR checks each of the artifacts and creates a metadata file with a list of artifacts and checksums and the time that the backup was started and finished. So BBR will only work with Bosch releases that have implemented the backup and restore scripts. And here are the releases that currently do, have implemented those scripts. So the Bosch Director, Credhub, UAA, Elastic Runtime, and the work I talked about where so Rooting is backing up its own table and CCDB is backing up another table. That work is in progress and when that's done, BBR will work against open source cloud foundry. We're also working on BBR support for data services. So BBR, I mean Henry talked about the different concerns around backup and restore. BBR is this, I think of as a small sharp tool, right? It's the very kernel of how to take a correct backup and restore. There's a much bigger ecosystem with many other problems to solve. So Shield has written an integration, or Stark and Wain has written an integration for Shield. And then I have really excitingly, Dell EMC has, is announcing a white paper today where they have done a POC where BoostFS is mounted into the jump box so that the backup artifact gets written directly to a data domain appliance. And there's several benefits to this. One is it's a kind of more secure, straightforward workflow for moving backup artifacts to long-term storage. Also, data domain dedoops the artifacts. And so you have a really powerful reduction in the amount of data storage required. There, the white paper is that URL. And we have Jessica here from Dell EMC and she would love to talk to you after, you want to stand up? Thanks. She would love to talk to anybody about this after, after the talk. So going forward, we'll be looking at how we need to extend the BBR contract to handle different circumstances. So to handle external data stores, for example. One really tricky problem we're trying to solve right now is how to back up, how to validate backup artifacts without running a full restore. Full restores are really expensive, kind of risky. So yeah, that's something we're focused on. We may extend the contract to support partial backup and restore, incremental backup and restore. There's like there are, you know, it's not a static contract. And then BBR itself, we're looking at, so at the moment we use Bosch SSH to drive the contract. We're looking at using a Bosch agent, for example, or the new Aaron's functionality, that's new in Bosch. And we're also looking at optimizations like calling all the backup scripts in parallel instead of serially. And we're also, so we're currently working on writing scripts to support external blob stores and also external data stores, databases, and data services. So going forward, it's good to think about how to make Cloud Foundry itself more backup-able. So I know that some of the teams writing backup scripts are implementing a read-only mode instead of making their API completely unavailable. Minimizing data references between data stores, so there's less kind of fragility, less consistency required. And also thinking about storing data in Cloud Foundry in a different way. So storing data as a sort of series of actions that can be replayed instead of kind of having a snapshot of state to recreate the world. So we've talked about what it means to back up and restore Cloud Foundry, how people are solving that problem today, what the contract that we've come up with to solve the how-do-you-back-up-a-distributed system problem, our particular implementation of it, which is Bosch backup and restore, and what we're looking at for the future. So we're part of the open-source extensions incubator, so all of our repos are public and we're in Slack and we're super friendly, so come say hello. So thank you for listening and we've got some a little bit of time for questions. Yeah, many times. Many times. Many times. Come down here. Yeah, I understand the requirement for locking the database before baking it up, but what about in a production environment? I mean, I cannot stop my environment for more than a few minutes. And usually backup can take even a long time if it's a large environment. So to be clear, what gets locked is the CFAPI. It's not, so it doesn't, it's not downtime for your app, it's downtime from pushing new bits. Yeah, yeah, yeah, right, but it means that no one can trigger stats for CF, requiring information for status of the application, anyone can push a new application or a stage or whatever it is. That's true, that's true. We have, certainly for the internal, internal Blobstore backup, we, rather than taking a full copy of the Blobstore, we use hardlinks on the, the Blobstore machine so we can create some hardlinks, unlock the CloudController as soon as the MySQL backup is done, which only takes a minute or two, and then as out of, out of band download all of those Blobs. So that, that minimizes the downtime. We've seen, yeah, we've seen downtime of less than 10 minutes for very, very large Blobstores. And the strategy we're pursuing for backing up external Blobstores is actually very similar, where we're, we're just taking references to the Blobs while the lock is held for consistency and then doing any sort of copying afterward. And one more question is how often do you, in your experience, so what's the frequency of the backup? The right frequency? So the, that is something that is dependent on how much data you're comfortable losing. So if your app developers are pushing, you know, a thousand times a day, then you're gonna want to take really frequent backups so that you don't lose, you know. But if your app developers are only pushing once a day, then you can back up less frequently. I mean, I think a starting point is every day but, you know, that is a decision that needs to be taken by the operator. Okay, thank you. Is there a plan to offer the backup and restore features to the app developers with the CLI tool? That's a really good question. So you can back up before you see a push new code, which migrates data structures and if it fails, you can restore. Yeah, so the Service Broker API has a way, has a backup and restore primitive and because, so for example, with on-demand services that are offered via Pivotal, the unit of a service instance is a Bosch deployment. So if you, you know, if you run BBR against that Bosch deployment, then you effectively get the backup that ties to your app. Does that make sense? Exposing this to the to Cloud Foundry is probably out of scope. Like, we need someone to talk about it, but it is Bosch backup and restore. Sure, there, I guess, I'm trying to save you, it's like, it's not your fault. The short answer is that there is, it's in the roadmap. Yes. I would love it for backup and restore to the exposed to users. I think there was mention of something with the the contract and the locking that there was an order that could be specified or something like that. So is that something that is not yet used or is it actually being used somehow in the backups now and I overlooked it because I was curious about that. Yeah, it is in use at the moment. So depending on what you're backing up, you might want to have certain components locked first. So a good example is if you care about backing up a particular a database that's changed by a CF app, then you might care that the app is stopped before the cloud controller is taken down after which you can't make any changes to app state. So that's one that's one case where we're using it. I'm not sure there are any others. No, that's that is the one at the moment. That's the case for that. Yeah. But I mean, in essence, it's arbitrary. So you can say I want this boss release is lock script to be called before this boss releases lock scripts. Very much. Thank you.