 Hello. Oh, that worked. Good morning. Good afternoon. Hi everybody, sorry for the slight late start. We are the platform recovery team from Pivotal. Today, we're going to talk about some exciting updates to our Bosch Back and Resort framework. So first part of the talk will be a brief product overview. So please don't be scared if you have never heard of BBR or used BBR before. Okay, so first to start a conversation. Let's talk about what is data in Cloud Foundry? What does it mean to back up the deployment and Bosch directors? So in Cloud Foundry, we have individual components like the Cloud Controller, where the UAA, that stores its data in the database. It can be internal MySQL, where it can be external database that operator has configured. We also have staged applications, what we call droplets, which are stored in the Blob Store. Similarly in Bosch, the director stores its date in the database and compiles releases and packages. So now you may ask, what can go wrong? Why do I need a plan to recover the platform? So we can have failures in software levels. For example, you have skipped a bunch of versions of CF deployment and you upgrade and upgrade has failed. Or it can be user errors. Somebody touched the database that they should not have and something went wrong because of that. Or security issues, some kind of attack has polluted your data stores and now you need to roll back. Another set of issues can relate it to hardware. So under a true disaster, your data center has burned down. You can't just sit there. You have to bring the platform back again in a different site. Or it can be a planned hardware upgrade. You upgrade to faster machines, better CPUs. Actions like that can generally introduce platform downtime and risks as well. So our solution, the Bosch Back and Restore framework, is essentially a contract between the Bosch Back and Restore binary and stateful individual Bosch releases. So the Bosch Back and Restore binary, what we call BBR in short, is a CLI tool that will orchestrate the Back and Restore workflow and at the same time it provides hooks so that individual components can implement its own backup and restore scripts. So let's go a bit deeper. What are the responsibilities of the Bosch Back and Restore binary, the orchestrator? So the Bosch Back and Restore binary, BBR CLI, the orchestrator, expects the components to implement certain scripts to do the backup and restore. And as part of the contract, during a backup or during a restore, BBR will trigger component scripts in a prescribed order. So for a backup, the order will be pre-backup log, which stops the component and the backup script, which backup the component and create a backup artifact, and then the post-backup unlock script, which starts the component again. For restore, the order will be pre-restore log, restore, and then post-restore unlock. During a backup, BBR is also responsible to collect all the backup artifacts from the individual remote VMs down to your local machine or the other way around during a restore. Yes. So the other side of the contract is the components. If a component wants to be backed up and restored by BBR, it will need to implement those BBR scripts that we talked about in previous slides. The component also has the freedom to choose to implement scripts according to its needs. So for example, if a component does not need to be locked during backup or restore, it can just skip the locking script and only implement the backup scripts. To facilitate and to enable individual components to write backup and restore scripts, our team, the platform recovery team, also develops backup and restore SDK release. This is an open source box release that knows how to backup different databases and different external blob source. And individual CF releases usually use our SDK release as a utility package inside of their backup and restore scripts. We also provide testing framework, what we call disaster recovery as substance tests, drafts for shorts, which exercise a full BBR backup and a BBR restore of a given CF deployment. This is to help individual release team to test their backup restore scripts before shipping. Yes, let's dive even deeper. So what is happening when you use BBR to backup a given deployment? Here, let's use CF deployment as an example. So imagine that you're in an operator and you hit run of BBR backup, CF deployment at your drumbox. The BBR binary would then start as a station onto all of the VMs that belongs to your CF deployment. So like UA, Cloud Controller, the router depends on your setup. It will then look for all of the backup related scripts. After finding those scripts, it will execute those scripts in the prescribed order that we talked about in previous slides. So all of the pre-backup block scripts first, and then all of the backup scripts, which creates backup artifacts, and then all of the post-backup on-lock scripts. Finishing up those scripts, BBR would then drain off the artifact that has been created in those remote individual VMs in your CF deployment down to the drumbox. Yes. So what does the contract give us? What are the benefits of separating the orchestration of the workflow and actual backup restore scripts? So first of all, under this contract, release authors are the one writing the backup and restore scripts. They are the experts of how their releases should be backed up and restored. And those backup restore scripts are now packaged inside of the releases. So they will always be compatible with whatever latest updates that's going into the releases. Another thing is that backup is now taken in a logical level rather than a hardware level. We are not taking a snapshot of your RDS instance or your Blob source. So this workflow is consistent across different IS. Operator doesn't have to learn new things when they deploy the foundation into a new infrastructure. We also have this concept of locking, which stops state-phone components before taking the backup, which preserve the consistencies across different data stores. Release authors also have a lot of freedom in what they can do in their backup restore scripts. They can make their scripts very smart. They don't have to backup all of the data. They can choose to ignore some of them. They can add metadata in their backup artifact. They can encrypt the data. Same thing goes with restore scripts. So as Chenyu mentioned, we implemented a test framework for backup and restore. It's a pattern that's applicable for any release that's writing scripts. The pattern is that you put some data into the data store. You run a backup. Then you remove that data from the data store. Then you run a restore and then you make sure that the data that you put in the first step is put back. We've done this to help release releases who are building backup and restore scripts. It gets called. For example, there's CF deployment. For CF deployment, there's DRATS. All of the releases in CF deployment that implement BBR scripts also have implemented test cases. Then we have the same framework for ERT slash PAS, which is PDRATS. Again, all of the releases inside of ERT have implemented those test cases. Now I get to talk about what's new. In the diagram on the right, that's the entire BBR framework. You've got the CLI. You've got the deployments that implement the scripts that BBR calls. Then you have the SDK, which helps a lot of those releases implement scripts. We have new things in each of those. If you look at the BBR CLI, we've introduced calling all of the scripts. All of the lock scripts get called in parallel instead of in serial. Then all of the backup scripts get called in parallel and the unlock scripts. As Chenyu mentioned, while the lock is held, state can't be mutated. What releases do is they typically stop their API. That means for CAPI that you can't do a CF push. That's what we mean when we say we reduce downtime is that because of parallelization, the CF API is locked for less time. The other thing, and then after the backup is run, we drain the backup artifacts back to the jump box or to a concourse where the BBR CLI triggered the backup. The draining is done in parallel as well, and that just reduces the overall time that the backup takes. We've also added the ability for an operator to specify an artifact path. If you run BBR on the jump box, you can say, I want to put the backup artifacts in this particular place. What that enables you to do is perhaps mount a really big disk in that folder or some long-term storage so you can put your data exactly where you want to in one step. The middle part of the BBR framework are the releases that have implemented support for BBR, implementing lock, backup, unlock, etc. The really exciting thing is that now CF deployment can be backed up and restored via BBR. Also, we've created ops files for external database and Blobstore support, so that means that if you set up CF deployment with an S3 Blobstore and an external database that you can still backup and restore CF deployment. It's also very exciting to announce that the PKS API have built BBR scripts and also PCF Redis. The SDK is the piece that a lot of releases that have built BBR scripts use to implement backup and restore. As we extend the functionality in the SDK, that means that any release that uses the SDK gets that functionality for free. We've added the external databases and support for TLS. Those are the versions of databases that we support. Also, external S3, external Blobstore support. Yes, so just as Street talked about, we have recently shipped our solution to backup and restore a versioned S3 Blobstores. This is enabled by using our backup and restore SDK release and our solution leveraged the versioning functionality of S3. When the operator runs BBR backup, we take a list of the latest version ID of all the blobs and then record them in the backup artifact. During the restore, we simply put all the version IDs that we have recorded in the backup artifact back to be the latest. This way of backing up and restoring the Blobstore does not involve downloading the actual town countant of the Blobstore and the backup artifact of the Blobstore does not contain the countant of the Blobstore and instead it only contains a file that has a list of version IDs, which means that the backup artifact will be smaller and it's great for foundation with very big parabytes of blobs. The drumbox or the concourse worker that you're running BBR command from can also stay small and doesn't need to have a disk as big as your Blobstore. We also leveraged concept of replication in S3. When you set a bucket replication in S3, version ID across the source bucket and the replicated bucket are kept in sync so that under a truly disaster where you can't access your original buckets at all, you still have the option to restore from the replicated buckets. We are aware that a lot of our customers use S3 compatible storage that does not have versioning as an option. So we have added support for backing up and restoring our versioned S3 Blobstore recently. So here we ask the operator to provide us backup buckets. During a BBR backup, we copy blobs from original buckets to those backup buckets that the operator provided us. And during a BBR restore, we will restore from the backup buckets to whatever buckets that operator wants to restore to. We are paralyzing the copying of the blobs to optimize for the speed. So compared to the previous solution that we have for the versioned Blobstores, yes, we are copying the blobs during backup and restore. However, we're not downloading the blobs. We are copying blobs between buckets. So your blobs are not leaving your storage. And again, this approach means that your backup bucket is smaller. Your drumbox and your concourse worker can also stay small. And this approach also minimizes network traffic. So the work that's in progress at the moment is we are building out the test framework for Bosch. So Bosch Deployment and the Bosch Director that ships in PCF are starting to use the SDK. And that means that when you configure them with an external database or an external Blobstore, your backup will... you can backup and restore using BBR. We are also building support for GCS and Azure external Blobstores. So... Yes, I get to talk about what's coming up. So for the BBR CLI, we are looking at how we can support operators in backing up and restoring all of their on-demand service instances. I don't know if you know about on-demand services, but they are separate Bosch deployments. Each service instance is a separate Bosch deployment. And so we are adding functionality to the BBR CLI so that an operator can backup and restore the data in all of the on-demand service instances. We're also doing some very exciting work in helping operators with backup artifact validation. I'll talk about that in a minute. Adding artifact encryption and then also the possibility to run BBR through a proxy instead of directly SSHing in. And then the deployment scripts that are coming up, as I mentioned, Bosch deployment, is going to be using the SDK, and there will be ops files for specifying external Blobstore or external database. And then PKS on-demand clusters is also implementing a BBR script. Okay, so this is the backup artifact validation I mentioned. So if you're an operator, we've talked to a lot of operators, and the best way to validate that your backup is valid is to do a full restore. And that's a really expensive, time-consuming, risky operation. So we have been thinking very hard about how to solve that problem, how to enable operators to have confidence in their backup artifacts without running a full-blown restore. So we've got two solutions, and I think we'll do both of them. This first one we just patented, and we'll donate that patent to the Cloud Foundry Foundation, but it's super exciting. So the way it works is the backup script puts out a set of metadata as well as the backup artifact, and the metadata describes the contents of the backup artifact. So for example, if it was UAA, it could potentially say, oh yeah, there are 50 users, and the third one is called Max. And then we create a scaled-down PCF that doesn't have internet access. We run this restore in the scaled-down PCF, but it's not a real restore. You're just putting the data back into the databases and having, writing a verify script. So each, like UAA would write a verify script that would read the metadata from the backup artifact and query the MySQL database or whatever to make sure that there were 50 users, and the third one was Max. And then you can just destroy that environment afterwards. So it's like completely risk-free, it's a lot less expensive, and it gives you much higher confidence in your backup artifact. Thank you. So then the other approach that we're going to take, so that, sorry, the approach I just talked about will work for any deployment. So like Redis or MySQL or whatever you're backing up can use that, can implement the metadata and the verify. This solution is specific to Cloud Foundry. So if you think about an app, an app running properly kind of threads through a lot of the interdependent components in Cloud Foundry. And so our idea is that you create a Canary app that uses as many of the components of Cloud Foundry as possible, and then you do a similar thing where you take a backup and then you restore into this scaled-down no-internet environment, you stop all the other apps from running, you only let your Canary app run, and it running gives you an indication that all of the systems will work properly in the restored PCF. So we're super excited about both of those, and that work is, we'll be starting in Q2 probably, I think. And then if we look at the longer-term roadmap, we will look at supporting larger databases. There's several changes in the BBR contract that will be required for that. So incremental backups, point-and-time restore, and then for super-big databases, you don't want to be copying the data back to the jumpbox, so being able to stream that directly to storage. So BBR sits in the Cloud Foundry ecosystem. We're open-source, we're part of the incubator, and we can't and don't want to solve all the problems that need solving. So I want to give a shout-out to our partners. Redis Labs is building BBR scripts for their Bosch Deploy Clustered Redis. Stark and Wayne has integrated BBR into SHIELD. They've also built BBR scripts for concourse. Dell EMC has done something really clever integrating data domain into the jumpbox, so you can just put your artifacts directly into long-term storage and have them de-duped. And then Crunchy Data is about to start working on BBR scripts. So we're, you know, we operate in an important ecosystem. So if I think about, you know, today's about an update. We went GA in August last year. I'm pretty excited about the kind of adoption that we've had. You know, all the different releases or deployments that have implemented BBR scripts. And the other, like, if we think about the ways that we've been building out BBR, I can kind of bucket them into increased coverage. So being able to use BBR in more hardware scenarios for different types of products, that sort of thing. And then also helping operators, you know, have a better experience backing up and restoring all the things that they manage. Okay. That's it. Do y'all have questions? Yes. About the artifact validation. What's the purpose of that validation? Is that to check whether the restore worked properly or is that to check the integrity of the backup? I think both of those things are the same thing. So, like, if I were an operator, what I would do is... Here, I'm going to go back up here. So we recommend that people use concourse to automate their backups. So what I would do is at the end of the concourse pipeline, I would add a step that uses this validate artifact command to use the backup artifact that was just created, run it through verify, and then you know that that or you have more confidence that that backup artifact will work in case of a disaster or whatever restore scenario you need it for. So instead of, like, the operators we've talked about will run a restore, I don't know, once a month, once a year, really not very often. And this means that with every single backup you can have confidence that it will work. So we do MD5s already. This is an extra layer of validation. Does this tool provide the ability to, after restoring everything, provide the ability to leave applications in a stop state instead of the state that it was in when the backup happened? Are you talking about the validate thing? Are you talking about the verify? No, it's more like when you restore the Cloud Controller database. Does it restore it in the state that it was? Or can you have the ability to bring it back up and modify the database so that all apps are in a stopped state? Just to give the application teams more control of when they want to bring apps back up after disaster recovery? The way you would do that is you would scale Diego, your Diego cells down to zero so the apps can't come back up, and then do your restore, and then you can just use the CFAPI to change the state to stopped, and then scale up your Diego cells. So I know you mentioned backing up the blob stores. Our disaster recovery plan is to actually not bring back any of the stuff in the blob store, right? People like our developer teams are already having repos that they back up. So our plan is we'll bring the environment up and you can push your app because our blob stores are probably more than a terabyte, have more than a terabyte of data. Is there an option to say exclude stuff that you don't want to back up? If I want to say I want to get UAA, everything except the blob store. Because before BBR I think there was CFOps which gave you the option of doing a light backup which does not include any of your blob store data and just backs up your build back stuff. Is that an option that is being considered? I don't think it is there currently because we did take a look at it. Is that something in your roadmap that might be available? We could do. I suppose the problem we wanted to solve was being able to get back to an entirely working system and not having to wait for app developers to repush after a restore because it has to do with your RTO, your recovery time objective. If you are in control of all the steps, so for example you don't need to repush, then that is a bounded amount of time. You know how long a restore takes and then at the end of that restore everything is back as it was. If you have a strategy where you are relying on your app developers to repush things, that is an unbounded amount of time and it may be that your app developers, their automation isn't up to date or isn't complete. You just get a better guarantee on the state of the world both in terms of time and correctness with the backup. I think including what he said. I think as people start going to CF there are more than one foundations because the strategy is to push our app to any number of foundations as possible and the ability to take a hit on a foundation if it's not available. We found it really helpful to give them all of that. You want to enable them to do everything that they want with your application and we bring up the platform and not having to restore a complete blob store gives us actually a better RTO. We just bring up the platform as quickly as possible and the app owners have the ability to say okay, let's push it, make sure our thing is okay before we enable our GSLB or LTNs and stuff like that. It would be nice to go through and see if that is an option that we can build into BPO. I will consider that. Thank you. For backups that have large amounts of data, you said when you're doing the validation you're just checking some metadata instead of the whole data, right? What strategy is it? Yeah, so the thing is that it is the people who are creating the backup script, so it's like the UAA team will decide what that metadata is and what they feel is enough of a check. The metadata could be the entire contents of the database. So we're just going to start that exploration with the UAA team actually next month. So we will know more about what the metadata needs to look like. Also, I was looking at the documentation and said that BBR doesn't save the service data. Is that referring to let's say you're running a Redis and there's data there? So each, in order for a service to be backed up with BBR, there has to be scripts packaged in the service that implement the BBR scripts. So that's just started to happen. So PCF Redis has just shipped with BBR scripts. And so as adoption increases, then that answer will change. So it's like for dynamic services? Yeah. Thanks for the presentation. We run a fleet of backing services with SAP on Cloud Foundry. So my question is that today we do disk-based snapshots using the infrastructure capabilities. So how do you see your solution fitting into this game? So primarily like you talked about having an agent running, let's say on the deployment itself, right? So we don't actually do a copy of the snapshot ourselves to S3. It's done by the infrastructure. So will your solution solve some things here? So the question is that we are relying on the infrastructure capabilities to do the backup or snapshot of the disk itself, right? Now your tool says that you have to either download the data which you explained earlier, or you do a block-to-block copy between buckets. So here let's take an example of AWS where the copy is done by AWS of the snapshot into S3. We don't control that. So how does this solution fit? So the kind of backup that you're talking about goes through the back door. And it like assumes that it knows everything about what's running and that you can get a consistent backup that way. Now in a distributed system, there's no like you cannot guarantee that if you're backing up this disk and this disk that it will happen at the same time and that the data is consistent. So the reason BBR goes through the front door and says, you know, lock, stop your API is so that you can get a consistent backup across a distributed system. So that's one thing. The other thing is that if you're like running an in-memory data cache like Redis, going through the front door means that that data gets flushed to disk before the backup is taken. And if you're just going in and taking disk snapshots, you're not going to get an accurate backup. Okay. All right. Yeah. Okay. Thank you. Looking at the AWS S3 External Blob Store versioning, is there a plan on the roadmap for Azure Blob Store? It's about halfway done right now. So right now, my understanding of the external Blob Store we're using S3 versioning is that's kind of thing to the PAS tile or the ERT tile or whatever. Yeah. But we're also using in our director tile external storage. Is there on the roadmap to get that integrated? Because as far as I know right now, I have to manually copy that up, tar it up. So do you remember we were talking about the SDK and any release or deployment that uses the SDK gets backup and restore of external Blob Stores. And so the Bosch team is actually just about to start the work to change their scripts to use the SDK. So once that work is complete, you will be able to use BBR to backup and restore Bosch directors configured with external Blob Stores and external databases with TLS. Yeah. Cool. Any more questions? You had a question. Did you have a question? I just had a question around the backup validation. So will we be able to view the metadata for that backup? OK. Because I had a situation where a service instance, a MySQL service instance was deleted. So I had to build out every single instance because I didn't know the name of it and then view the schema for each of those MySQL instances to determine which one we needed to restore. So being able to see what's in that metadata would be super helpful. Cool. That's a nice use case. We'll support encryption for backups. Yeah. That's on our roadmap as well. I mean, our experience is that that's a problem our customers can solve because they can use GPG and put that at the end of their concourse pipeline. But it is on our roadmap. Any? OK. One more. Yes. Let's say your whole foundation goes away for some reason. If you wanted to use the restore, would you have to at least bring up part of the system? The vanilla version, then you could restore over it? Yeah. The first, the way that you restore, that you get your Cloud Foundry back, the very first step is to install Cloud Foundry. Yeah. OK. I think we'll call it. Thank you so much.