 Hello, and welcome to Burning Down the House. This will be a quick tutorial on how you burn down your data center, and hopefully get it put back. This is Therese, I'm Jatin. We are from the backup and restore team at Pivotal, and we are here to talk about the backup and restore of Cloud Foundry. In this talk, we are gonna cover how can you backup and restore Cloud Foundry, what are the current approaches of doing that, the current issues, what is our solution to the problem, and the future of backup and restore for Cloud Foundry. So why backup and restore Cloud Foundry in the first place? So companies run mission-critical workloads on Cloud Foundry, and as a part of running mission-critical workloads on Cloud Foundry involves planning for when things go wrong. We have seen one of the first things operators do when they install Cloud Foundry in their organization is install some kind of a backup and restore solution. App downtime is a pretty big deal for companies, as it can lead to loss of revenue or customers. And in a lot of organizations, Cloud Foundry is actually embedded into the development life cycle. So when Cloud Foundry goes down, you actually are losing the ability to develop your applications. So what can go wrong? What are people planning out for? So there can be things like recovering from a failed upgrade, where your upgrade has left your Foundry in a bad place and you want to go back to a point in time. It can be things like user errors in which administrator accidentally deletes user-critical information. An example of that is the GitLab meltdown in which requests could not be served because parts of the needed data were accidentally deleted. There can be issues like data corruptions and nowadays there are security vectors like ransomware which can affect applications. By the way, the easiest fix for a ransomware situation is to do a repave and put your data back. And we also have failures like a SAN corruption and when your data center floods. We are gonna target the data center failure as our use case because it's like the superset of all the use cases that we see over here. So what does it mean to back up Cloud Foundry? So Cloud Foundry itself is a complicated distributed system and to back up Cloud Foundry, we have to essentially need to extract the state that it has accumulated over its running life cycle so that when we put the state back and flip the switch back on, it should resume operations. So what it really means is try to get enough information out of your installation so that when you put it back during restore times, your app should come back up. So that begs the question, where does the state in Cloud Foundry live and what state do we need to capture to create an effective backup of Cloud Foundry? So to explore that, let's look at how Cloud Foundry stores its state during its life cycle. So imagine a Cloud Foundry with just two components, a Cloud Controller and a few Diogo cells. When an app developer pushes their application to Cloud Foundry, what they do is Cloud Foundry will insert metadata about the application in the database and it will save the source of the application itself in a blob store. As you can see there are, this subset of Cloud Foundry uses two data sources, a database to store records and blob store to save files. When Cloud Foundry is deployed, it is pre-populated with build packs. Build packs are programs which will convert source codes to containers. After that, the Cloud Controller will ask one of the Diego cells to state the application. What that would do is fetch the build packs and the source from the blob store and create a droplet. Then it will save the droplet inside the database, save the droplet metadata inside the database and save the actual droplet in the blob store. And in the last stage, the Cloud Controller will ask one of Diego cells to run the application which will pull the droplet out of the blob store and start to run it. So what we can see here is there is some referential integrity between the blob store and the database. So there are references going out from the database to the blobs and if we do a restore and these records don't line up, your apps might not come back. So these are not the only two components in Cloud Foundry, there are other components and they have their own data source. And these data sources might have similar referential integrity concerns across data source. So if we put back all the state that we saw before, will we, do we have given up information to do a restore? Well, the state that we saw before is just part of the puzzle. There is also Cloud Foundry itself, like the Cloud Foundry, the software which runs and deploys your applications. So where does that live? So to find that out, we need to go deeper. So Cloud Foundry is deployed and managed on a control plane provided by Bosch. Bosch is responsible for actually deploying Cloud Foundry. All the information about the software will be stored in Bosch. So let's take a look at how data flow looks like in a Bosch director. The Bosch director also has two data sources. So it has a record store and a blob store. When an operator uploads a release to the Bosch director, the Bosch director will save some metadata about the release in a database and save the actual software in a blob store. When the operator asks Bosch to deploy that software, it would save some metadata about the deployment in its record store. After that, it will fetch information about the releases and the deployment from the record store and get the actual software from the blob store and try to create a VM out of it. So we see this pattern again, where we have a software trying to get information which is stored across data stores. And there is some referential integrity which grows across different components. So Bosch is also used to deploy other things, like data services. Apps actually connect to these data services, which are deployed on Bosch to save user data in them. So if you want to take an effective backup of the entire platform, the data landscape kind of looks like this. So what do we do when parts of this catch on fire? So currently, when we started with the backup and restore initiative, we looked at what companies currently do for solving this problem. And we realized there are a lot of localized solutions for solving this problem. The first class of tools that we discovered was tools that know a lot about where state in Cloud Foundry exists. So these are tools which will go in and will know where which databases are used by which components in Cloud Foundry. We'll do SQL dumps, or we'll just copy files, which they think are required to create a good backup of Cloud Foundry. So the primary problem with tools like this is fragility. So Cloud Foundry, for a software of its size, moves pretty frequently. And whenever it changes how it stores data internally, these tools start to break. And so which means that to create a tool which will always backup a Cloud Foundry, you need to change the tool to know about a lot of versions of Cloud Foundry and how data was represented across those versions. The other thing is consistency. So tools like this cannot just backup individual databases. They need an effective snapshot across those databases. So which means that they somehow have to stop the Cloud Controller to take a consistent backup across these data stores so that when the restore happens, the records match up. Which means that these tools are susceptible to down times. The other class of solutions is running two Cloud Foundries, an active active or active passive mode, in which essentially build tooling around pushing applications to two foundations and trying to maintain the state and them consistently. So the issue with this is it's really difficult to get right, because you have a consistency problem now across Cloud Foundries. And it essentially doubles your cost because you are running two production-like environments. And it does not mitigate against things like the malware vectors that we saw before. The other class of solutions people are trying out is IAS Snapshot. So some IASs will give you a functionality which will snapshot persistent disks that have been attached to instances. So the issue with this is it's like pulling data from underneath the application. Not everything might have been flushed down to disk when the snapshot was taken. And the other interesting bit is not all state is needed for a successful backup. Like only a subset of the information on disk might be required to do a successful restore. The other strangely popular opinion is to do nothing. So a lot of operators which script out things like Cloud Foundry configuration like creating orgs, users, and spaces. And in case of a disaster, you would create this structure again and ask your app developers to repush your applications. The problem with this essentially is the recovery time might be really long, because not all teams have really good pipelines to push their apps in again. And so the other common themes that we see across organizations is people are worrying about things like how do they do artifact management? How do they encrypt the backups that they get? When they should take the backups? How long should they retain their artifacts? Should they do complete backups or partial backups, which means that all the components that we saw before, should they back up all of them all the time? Or can they get away with just backing up some of them some of the times? Should they do incremental backups? So which leaves us with a problem state, a problem space, which looks kind of like this. Therese will take us through how we have been solving this problem. So we took everything we learned. And we did a lot of customer research and a lot of research into current solutions. And we came up with a new model of backup and restore to try to address the issues that we found. So if we start with this problem space, we divided the issues into two categories. On the one hand, there's how to orchestrate the backup. So that includes all the concerns around the backup. Consistency across backups, encryption, scheduling, artifact management, things like that. And the other category is how to do the backup, how to actually do the backup and the restore. So things like database versions, what is the data that needs to be backed up, and what is the correct procedure for a backup or for a restore? We assigned these two categories roles. The first role is the orchestrator, and the other role is the component. And we've crafted a contract between these two roles. So we have a backup and restore framework based on a contract that sets out the requirements for the orchestrator and the component. And now I'm going to go into the details of that contract. So the orchestrator expects the component to implement hooks to do the backup and restore. The hooks are lock, backup, and unlock. The first requirement for the orchestrator is that it has to trigger these hooks in this prescribed order. And the only hook for restore is restore. So the orchestrator is responsible for moving the backup artifacts after a backup is taken and putting the artifacts in the right place before a restore. Now the other side of the contract is the component. So if a component wants to be backed up and restored, it has to implement the backup and restore hooks, which are lock, backup, unlock, and restore. So the component, each of those hooks is actually optional. So the component will implement the hooks that are appropriate to it. So if it's a database that needs to be locked, I mean, that needs to have rights stopped, it might implement lock. And then backup would be like, you know, PGDOMP or something. And then unlock would start up the database again. And then if nothing needed to be backed up, it wouldn't implement any scripts at all, and that would be fine, nothing would happen. So the component, this is the other side of the contract, has to put the backup artifact in a particular location so that the orchestrator can find it and take artifact, the artifacts to be restored from a particular place so that it can be received from the orchestrator. So here's all the things, right? It's a pretty simple contract. So the contract separates backup orchestration from the actual backup and restore logic. The backup hooks are written by component authors who are best placed to determine which parts of component state are relevant for a backup and what the correct behavior for a backup and restore is. Because the hooks are packaged with the component, they can change as the component changes. So they should never be out of sync with how the data is stored. You'll see that the contract addresses the fragility and the compatibility issues that we saw with tools that reach under the hood by encapsulating the backup and restore knowledge in the component. So there are benefits to this contract. Firstly, consistency. So asking the component to do its own backup means that it can flush data to desk, can stop database rights, whatever it needs to do to take a consistent backup. And locking across components prevents state mutations during backup, enabling, again, consistent backups. Secondly, backup and restore can be smart. So processing and filtering data in any way that makes sense. So the backup script might do encryption, a restore script might regenerate credentials and the orchestrator actually doesn't care what the script does as long as it abides by the contract. Another benefit to the contract is that the artifact transfer happens after the unlock step and so it minimizes the amount of downtime that a lock might incur. So I've talked about the contract and the abstract. We have implemented the contract as Bosch backup and restore. So in our implementation, the orchestrator is a binary called BBR. A component is a Bosch job and the unit of backup is a Bosch deployment. So you'll notice that we're tied to and have leveraged Bosch concepts. Well, why is that? All components in Cloud Foundry are Bosch deployments. The Bosch director is a Bosch release and for an operator, a Bosch deployment is the logical unit of a backup. So the orchestrator is the BBR binary. It's a CLI that runs on a jump box. It knows how to trigger the backup and restore hooks on both Bosch deployments and Bosch directors. So BBR can back up the state and the software in Cloud Foundry but a restore doesn't do a redeploy. So it's kind of like my SQL restore. And I'm really, we are very excited to announce that we have submitted to the CF extensions to open source BBR. Okay, so earlier, Jatin showed you where state is stored in Cloud Foundry. So what I'm gonna do now is I'm gonna walk through how BBR backs that state up. So there's our operator, calls backup and then BBR will first identify any lock scripts. So the Cloud Controller implements lock, the lock script and what it does is it stops the CF API to avoid mutating state during the backup. Then BBR will call all of the backup scripts. So the backup scripts will generate backup artifacts for each job in the deployment, each job that implements backup. For Cloud Foundry, we'll end up with the UAA database, CCDB, the Blob Store and some other things. Then BBR will call the unlock script. And finally, BBR will copy all the artifacts back to the jump box and the backup is finished. Okay, so I'm gonna give you a demo. Me and live demos, not so good. So it's a video, it's sped up but it is the real thing. Here we go. Now I hope that this is, it's running great, okay. So I hope it's big enough to see it. So we're backing up the Bosch Director here. You can see that, I'm gonna come over here so I can see. It identifies which scripts are there. It's backing up CredHub, the Director, the Director Database, the Blob Store and UAA. And then it's done some, it does checksums, it does validity checks. And then it creates a tar file. And then if you look, it also creates a metadata file which has each of the artifacts that it created and also the start and finish time of the backup. And then you can see that we ran restore and we've put all that data back into Bosch. So in order for, BBR can only back up and restore things that implement the backup and restore scripts. And here are the releases that we have at the moment. We have, we can back up and restore the Bosch Director, CredHub, UAA, Elastic Runtime which is Pivotal Cloud Foundry. And coming soon we'll have backup and restore scripts for open source Cloud Foundry and for data services. So BBR, the problem that we're trying to solve is the kind of inner core of backup and restore. It's creating the backup artifact and putting it back. There, you know, as we talked about in our earlier slide about concerns around backup and restore, there are a lot of other concerns. So encryption, scheduling, artifact management, the, you know, a lot of other things. Well, we have partners and we're working with them to integrate BBR with external data stores and with automation that does things like scheduling, encryption, artifact management. And we really do welcome collaboration. So please get in touch if you're interested in integrating BBR with your backup and restore project. We're looking very actively at how we will need to extend the BBR contract to support different backup and restore scenarios and to improve operability. So we're looking at being able to run BBR against multiple deployments and supporting incremental backups, points in time recovery, external data stores, and then having a way to validate backup artifacts without running a restore because that's a very expensive thing to do. And then as we extend the contract, we'll build BBR out to match the contract extensions. We're also looking, so BBR uses SSH to run, to tunnel into the jobs and run the scripts. What we're looking at is triggering those with a Bosch agent in the future. We're also looking at optimizations like calling all of the backup scripts in parallel instead of in serial. When we think about the future of backup and restore for Cloud Foundry, it's really about making Cloud Foundry more backup-able. So doing things like adding read-only modes to minimize the downtime and trying to minimize the amount of references across data stores. And then in the future, and this is a pretty exciting vision, is to change the way that state is stored in Cloud Foundry so that it's in an event stream. So it's basically, you can replay all the events that have happened to regenerate the state. So we've talked about Cloud Foundry, what it means to backup Cloud Foundry, the state and the software. We've talked about the current approaches that we've seen, the issues with those approaches. We've talked about the BBR contract that we've created. We've talked about our implementation, BBR. And we've talked about what backup and restore might look like in the future. So I want to say thank you, I know it's the end of the day. And you'll probably want some adult drinks soon. We have, I mean, we're gonna answer questions now. I will also say that we'll be taking part in the extension's office hours tomorrow morning, or well, tomorrow at 12 10. And hot off the press, we have BBR stickers. So I think every fashionable laptop wants one. Okay, so do we, let's have questions. I have to say this is the best round of applause that we've had. Pretty good, all right. Can you state your name? Matt, with the little, yeah, I'm curious what sort of duration you're seeing on the locks, just for the Cloud Foundry as you packed up, but how long do those last? Is it in a locks state to take them back? We can't hear you, use the mic. Yeah, just, yeah, share it. Is it, you have to turn it on. It depends upon the data that is there in the MySQL and the Blobstore. That's right. How many pre-dubs, he's asking. We have not tried pre-dubs yet. So it depends on how big the MySQL data store is. So for the block, for the NFS Blobstore, we can do stuff like hard links where we just link the data, link the Blobstore another directory so that the backup utility can copy them off. So the primary lock time driver for us right now is the MySQL dump. So for, we are seeing times like in minutes, so it's like five to six minutes. Any other questions? So, kind of interesting is that I know that down the line, you're looking at doing the data services as well as components there. And we have a lot of experience where I work with the MySQL data service. And I know that they actually change the way that they're offering backups in that front. They used to do the MySQL dump and now they're actually backing up the actual data files themselves. Is there some sort of cross-team collaboration there to see like maybe why did they go down that path? Cause you guys are still doing the MySQL ones on the cloud finder, and obviously those could be much smaller than what is potentially on the data services, but I don't know if that could speed up times. So instead of using MySQL dump, I'm actually backing up the data files themselves. So the current implementation of the scripts for cloud foundry, pivotal cloud foundry, don't use MySQL dump, they do use the MySQL database mechanism provided by the MySQL team. So the whole point of the contract is that the teams, it would be, it will be the MySQL teams who write the backup and restore scripts. So the decision about how to back it up will be their decision. More questions? Yeah. Maybe. So when you implement a lock for a specific component, does that lock also propagate to things that may be posting data to those things? So if you lock a database as part of a component, will it also lock things that are trying to write to that component as well? No? No. Okay, one more? All right, last question for it. Everybody wants to get that little drink you mentioned. Yeah, two questions actually. Which version of PCF is it coming on and is it replacing CF ops? 111, and over time, yes. Thank you, everybody. Thank you, Terence. Thank you.