 Hello, everyone. Welcome. Welcome to reviving the platform every day. My name is Emmanuel Kias, and this is Josh here, and we are a member of the platform recovery team from Pivotal. So let's look at our agenda for today. Oh, it works. So our agenda for today, first, we're going to talk about what disaster recovery is and how we test it and what's like the challenge behind it. Then we're going to talk about the first steps we took as a team to address this problem. Then we're going to see how we distributed the ownership and how we integrate everything back together. And that will bring into the discussion the drafts framework, the disaster recovery acceptance test, which is like the meaty part of the talk. This is like, and then we're going to see how we create some CSI tasks. Then we're going to talk about the ongoing development, because this is a living project and it keeps going on and which are like the next steps in our journey. So now we're going to talk about disaster recovery and what does it mean and how do we test it? We're about like to frame the problem. So what is disaster recovery? If you think about it, it is actually a plan on how you recover in case of a disaster. So imagine something that you have a system that is running the production. Something goes wrong. When you freak out and say, I'm going to have to bring back by production system, what do I do? And these are the exact steps and this is your disaster recovery strategy, assuming that you take backups, of course. You can recover. Then we have a question. Can Cloud Fund really recover from a disaster? And the answer is yes, because we have BBR, Boss Backup and Restore. And BBR, what gives you, gives you an automated way to take backups and a way to take these backups and restore them in case of a disaster. So like the thought now comes into mind is, good, I have now BBR, I have the technical foundation, which I can use to take backups and take restores. But how, but can I actually feel safe now? If you have a backup, that doesn't necessarily mean that you can recover from a disaster. How do I know that my platform can recover? And how do I test that BBR actually works? I have a backup. How do I test that actually works? Before I starting these questions, let's look at like a very minimal disaster recovery strategies. You take a backup, everything's fine, then suddenly disaster happens. You go, fed your instructions, you have your backup, you restore it and things should be back in normal. But how do we test that thing? We have a bunch of instructions, but we need to test drive it as any other feature we do in our code. If we start thinking about, okay, now I have these things in our heads, this is how maybe a test case would look like. I'm going to set up a fresh component, if I'm testing the R for that component. I'm going to take a, so once I set it up, I'm going to create some state, I'm going to take a backup. I will try to simulate some disaster, I'm going to destroy some things probably, then I'm going to restore from my backup. And finally, I'm going to set, hey, it's like my state A back after like my restore. And that looks like a simple test case. If I like to visualize, imagine that we're responsible of only one thing, this is a, have the user imagination, that this is like my component, and that component has some state. And if you look at that, now I have to take it of this state and then I'm fine, I take it back up and then I can restore it. But things, unfortunately, I don't know that straightforward, like in distributed systems. Now a picture like that would look a thing in Cloud Foundry and there's something like that. So these are some of the stateful components. So now suddenly, the previous image explodes from a single component I have to take care of one thing. It's like many things I have to take care of about and all these many things stateful make a bigger thing, which is called Cloud Foundry. So now we talk about like the first steps we took towards this challenge. We know how we want to have our disaster recovery. We know now how we want to test it, but how do we actually make it happen? So now the challenge is was given to our team, the platform recovery to drive this feature across all these teams. But now we have a bunch of questions. We don't really know how its component works in Cloud Foundry. We have an idea about how do we test like the DR strategy for each individual component? How then we make it work for the whole platform? And the most important thing, how we coordinate with all these people because we need to go and ask questions. If we look at a bit like, so these are like the teams in Cloud Foundry. We have CAP, UAA, Networking, Credit Hub and our team and all are like distributed in different time zones. You can imagine like a map now. You have like US and Europe and we're like different time zones. So now suddenly we have a technical problem, which is OK, I have all these kind of things and it does drive. But we also have a process problem now because we have all these people we have talked with. So we have like to start from somewhere, we have all these questions, but we haven't yet start assigning them. We have a big box and this big box is our big problem. OK, we have a big problem. So the first step, like the first step, let's break it down into smaller problems. And if we try to do a mental mapping to that thing, we have the big problem called CF and then the smaller problems can be worked as individual components. And at that point, to be honest, we just cheated because we got inspired of how Cloud Foundry was distributed like in the first place. We have individual teams working individual components independently and then coming together to a whole system so that we got inspired of that and say, OK, let's break down our big problem into smaller problems. And that's what happened. We say, OK, each component should have its own test case and then without some first sample test case. At this point, the sample is very important because we don't really know. So at this point, we're like in the discovery process. We are trying to figure out how to test drive a component. But we don't really know what exactly means to test drive it like a component because we don't really know if a component has properly restored or like the teams who actually develop the thing know how to do it. So we're at the first test cases. And if we go back a little bit, this is how a test case looks like. It's just like to remind ourselves. We have all these kind of steps. It's not just a straightforward thing that I have a feature. I put some input, I have some output and then I set on it. So we need some kind of orchestration. So as soon as we have these test cases, what do we do now? We need something to run them. And this is what we did. We need a test suite, like a framework that orchestrates this lifecycle that creates something, push some state, take a backup, destroy things and all these things we talked about. And this is how actually Drats came in life. So Drats is a disaster recovery acceptance test. We wrote which is like a framework that all these test cases run on. Now I'm going to talk about ownership and integration. We have now a bunch of things in our hands. We have a test suite and we have some first samples. But we're still like on our own. We're still like on the platform recovery, trying to figure out how to make the thing and to end work. So I think it's... So for us, it was time to divide and conquer. We have a distributed system. So why not distribute the ownership? And we go and reach out to the team and say, hey, here's like a tool to run your test cases. And if you want to write a test case, this is how a test case should look like. Why don't you copy that around and please can you contribute to that? And that's what they did. And its individual Cloud Foundry team started like developing on their own independently from each other, their own test cases using like the framework we provided. So okay, now things are in motion. We have pieces there and there like we being developed. And each team was test driving their own disaster recovery for their feature. But the art is not really useful unless it works for the whole thing, right? So imagine that you have 10 things you care about. Something happens, you restore and only seven of them work and three of them do not. So you cannot really assume that the platform is healthy at that point. So you need to make sure that everything comes back together. So we distribute them. So now we have to put back everything together. And that's what we think. So that's what we did. Imagine that every team finished their work and then we put everything together under a single place and treat them as a single test. So having all these test cases run I guess a single shape deployment as like a single test. And that's why we integrate everything. And then finally, okay, we have all these tests. Now who runs these things? Many teams run them for different reasons. First of all, our team runs this because we own them and we want to make sure that the framework and the test cases wire and work together. Then it's individual Cloud Foundry teams run them because they want to make sure their test cases and their component works in this end-to-end disaster recovery strategy. Then finally you have teams like Release Integration which runs the test before they cut a new release for CF deployment to make sure that the whole platform works. So different teams run them for different reasons. Now I'm going to hand over like to Josh. He's going to talk to the actual about framework about Drats. Thank you, Emmanuel. I can just about see you at the back. Hi. Has anyone here run BBR? Oh, actually quite a few. Cool. Good to know, good to know. So I'm going to talk to you about Drats and about this shared tooling that we created in order to drive out this cost-cutting feature, what this came to be in a concrete form. So Drats is this exception test suite that automates a full backup and restore. And each test case has different hooks it can take for its setup and confirming how it thinks something should work. And it allows teams to just run their test case to focus on what's important to them. The code is open source, you can go and find it on GitHub. So what do you need to run Drats? You need Cloud Foundry application runtime and the BBR CLI, essentially. But what is an acceptance test? So like this is not the first kind of acceptance test for Cloud Foundry. Before it there are things like cats and watts and gnats and rats. And there are a lot of things that end in ATS. Drats is just one of the newer ones. An acceptance test is something that's gonna hit the platform from the outside. We're talking about running CF commands, right? CF CLI commands. And that kind of means you end up exercising the platform, right? You're not just exercising a component on its own, you're exercising the platform and the feature of the whole platform. So I just wanna make it clear what level of testing we're talking about. So what does Drats do? In essence, it runs two commands. It's gonna run BBR deployment backup and BBR deployment restore. That in itself would just be an in-place restore of something that hasn't changed. Not so exciting, okay? We've got to do more than that. Let's make it harder. So we created these hooks and these things are about setting up some state in the platform, backing it up. Then we have a hook which allows you to mess around with that state. Maybe you delete one of the things or create a new thing. Then we're gonna run restore and make sure we're back to where we were, that initial state that we created to prove that the restore has taken us back to that snapshot that the backup has captured. And then we need to clean up that state afterwards so that we can keep running Drats again and again. And this is the concrete implementation that comes down to. This is in Golang. As you know, we're fans of Ginko and Gomega. So a lot of our test suites end up in Golang. This is the interface for a test case. Apart from those hooks I just described, there's a name that's important too. Let's know which one you're running. And these are the hooks, before backup, after backup, after restore and clean up. So I'm just gonna step you through one of the genuine test cases. There's one for UAA, and that's user authentication and authorization. What does it do? So before backup, we create a user and we verify that we can log in. Okay, so we know as the UAA team that we've created an entry in the UAA's database, but I've done it at a platform level. I've run CF create user, CF log in. After backup, I'm gonna delete it and make sure that it's gone. Log in does not work. Then we run restore and we wanna go, huh, I should be able to log in again. Does that make sense? The status come back. That UAA entry in its database of the user has come back. And then afterwards we're just gonna delete that user. So we could run this test again and again. So there are lots of test cases. At the moment, these are, I think this is most of them. So there's one for testing apps, which kind of is sort of CF org spaces and apps that are purged. There's an interesting one about app uptime, about apps staying up during backup. There's the UAA, cred hub, networking, rooter groups, LFS broker. There's a bunch of test cases that are focused on the behavior that each component provides to the platform. And they really kind of map fairly closely to the state in the platform that a component owns. So let's look at app uptime. I'm just gonna pick on this one because it's a little bit different from a lot of the others. And it's kind of interesting because this is a feature of backup. Is that when you run BBR backup, you want all your apps to keep running, right? You don't want everything to go down just because you're taking a backup. That is not, so a core feature of BBR implementation is that apps all keep running. The platform keeps running. So before backup, we push an app and we start polling that thing. I think it's like every second. And then after backup, we stop and we verify that every poll was successful. And this is like one of the ways we prove that right through BBR backup process, the core functionality of the platform which is running apps has survived, right? And that is something that should not be interrupted by running backup. Then after restore is a no-up and clean up, we delete the app so we can be ready to run it again. Now, Drats also has a hard mode. So if you've been astute, you might have noticed that what we've really done is just some in-place restore. And you can run into scenarios where actually you're relying on states that hasn't quite been cleaned up or whatever. And this, we have a flag that allows you to run Drats with a full delete of the Cloud Foundry deployment. So all the VMs and all the disks are deleted. Then we bring up a clean one and restore into that clean one. And this is to absolutely prove that your backup artifact is a complete set of the things that you need to restore. So I talked about focus. So if I'm in the UAA team, I just want to run my test case in my pipeline. I don't want to run everyone else's test case because that's like just too much noise. And especially at a platform level like this acceptance test level, you can get slow feedback in these things and you really want to minimize the impact of running these tests as much as possible. So in the config that Drats takes, there's basically a bunch of Boolean flags just to say, which one do you want to run? And it means that every team can just independently work on their disaster recovery or their contribution to disaster recovery in the platform without interrupting the other teams. And as Manuel said earlier, when we were across many time zones, that is vital that we can iterate independently. Cool, so I've kind of hinted now that people are going to run this in their own pipelines. So this again becomes a cross-cutting thing. We don't want everyone to reinvent the wheel of how to run Drats. People are just like, give me Drats as a product. So the other thing we help them with is shared CI tasks. So they could just drop them into their pipelines. So we're really lucky in CF that we have a homogenous CI landscape. We all use concourse, that's super useful to us for driving out this cross-cutting feature. It meant that we could just produce something to put into their concourse pipelines. A thing we've learned, I think in the history of developing CF is that pipelines tend to belong to teams and tend to change a lot. Like a lot of the time stuff is changing in pipelines. And the thing that we've, so we couldn't just give people pipelines, right? It's really hard. They're very brittle things to maintain. I do not recommend this. What I do recommend is finding the pieces in the pipeline that you can reuse, and we found that those are tasks. So in concourse, a task is something that you can actually give to people. It's got a small API. This is a thing that we're going to maintain for you, use this and reuse it across many pipelines. So this is one of the tasks, one of the tasks to run Drats with this thing we call integration config, which is just a piece of JSON with some config of what you want to run. And who uses concourse? We've got concourse users here, yeah, quite a few. Great, so I've stripped away a lot of the noise in a concourse task definition. So like what Linux container we're running in, it doesn't really matter. The prams maybe matter a bit more, but I took them out because they're a bit noisy. What really matters is I've got a set of inputs. I've got the Drats test suite. I've got the BBR binary, and I've got some config. That's all I need to run Drats. You want to keep this API as small as possible so that you can maintain this long term, make this task as reusable as possible. So you can see there, I've got this integration config thing. How do I get one of those? So one of the things that comes up is people don't want to be told how to deploy a Cloud Foundry. We want every team to be able to deploy the Cloud Foundry however suits them. And many teams have different kinds of tooling. So we focused on one that most CF teams are using, and that's BBL, that's a Bosch bootloader. And that's used a lot within the R&D teams for deploying CF. So we provided a task that basically takes some base integration config, takes your vars for your CF deployment, and your BBL state, which is stuff around your Bosch director, and just populates all the credentials into your integration config. Because what you need in there is a bunch of CF credentials and a bunch of Bosch credentials. And that kind of activity is something that could be reused across many, many teams if they're using BBL and CF deployment. So this task became a nice, reusable component. So here we go. Here was all these teams, and they've all got pipelines. And I'm gonna get to show you some. Who loves a pipeline? Yeah, okay, cool. It's basically a graph, right? Everyone likes graphs. So, Emmanuel said earlier, why are we all running this? Like, what's my purpose of running DRATs? So the purpose, if you're a component team here, is to check that my components disaster recovery works, right? And that's all I really want to care about. So these teams tend to run DRATs focused on one test case, maybe two. And they don't want to know about the others. Then we've got like the two teams that end up running all of DRATs. And that's release integration and us, platform recovery. But we run it for two different reasons. So release integration, they have the job of producing CF deployment, the stable CF deployment that can be relied upon. So they run all of DRATs to make sure that the entire platform can be restored from disaster. Platform recovery, we run all of DRATs so that we know that every test case is working and the test suite is working, which is a slightly subtle and different nuance, right? It's like we're trying to take the pressure of release integration by helping and talking to all these teams and understanding the business of disaster recovery and having some expertise around that to take the pressure of release integration who've got to integrate all the things and their job is hard, right? So we had this specialist role around DRATs, which I think just helps release integration. This one was perfect. The day I went to get Kredhub CI, the job where DRATs is running is actually failing. There it is, boom. Interestingly, so this is in Kredhub, their component. This is like their edge build and what's happening is this job is actually called test CF integration because Kredhub is used in a variety of cases. This is why I think they're like system tests are just like fanning out all over the shop here. Different back ends, different use cases like, is it with CF, is it with Bosch? All sorts of stuff going on. But there's it in Kredhub. This is Cappy, the Cloud Controller API. Would anybody like to take a guess? I'm taking guesses. Where is DRATs? Anybody? You wanna, come on, anybody? Get a guess, a column. Just shout it out. You wanna have a go? Come on, yeah? In the big, tall one. I like where you're going with this. There's two. They're actually running DRATs twice. This is actually super cool. This is great. The reason for this is actually because of two different backing databases. So they're testing it when Cappy is deployed with MySQL and when Cappy is deployed with a Postgres, which is really nice. So it's like a real full integration. This is release integration. Just to show you that they actually cut up their pipelines a bit more because they have so much work to do, as I mentioned earlier. You can see here, they take every release candidate of Cloud Foundry deployment and it has to run through DRATs. And then here's our team's DRATs pipeline. And what we're taking is latest CF deployment, latest BBR, latest DRATs, and we run everything. And also we do the hard mode one, run with destroy, where we just destroy the platform every day multiple times. This thing has run like in the thousands of times. It's pretty cool. The build history is very long. And we also here, you can see that we're running pull requests. So as maintainers of the test suite, we are the point at which changes to test cases and changes to the suite are made. And other teams can submit pull requests and they have to go through here, and this is our CI for every pull request to change DRATs. So I want to talk about the ongoing development. There's lots of things coming up. Lots of things that have happened even in the last year since BBR for CF has gone GA. So we're maintaining this thing. It's a living, breathing thing. DRATs continues to evolve. We're even, I'll show you those tasks. Those are not the only tasks that have happened. There are other tasks that we're trying to now get rid of and deprecate as we learn more and make this thing more efficient. And we're helping CF teams as they move forward. One of the things that's come up that's quite interesting is about versioning. So do I need a version of DRATs? And at the moment, we're trying to resist that. So you may have heard in Pivotal we really push onto trunk-based development as much as possible. We don't like versions of things if we can avoid it. But one case here is of CredHub where they've shifted from a V1 API to a V2 API. It's a breaking change for how CredHub works. And also it broke their DRATs test case. It just didn't work anymore. So what we've pushed down is to make every test case responsible for its own versioning. So there's no way we could know in Platform Recovery what the particulars are of this CredHub V1, V2 stuff. And we're not the best people in the organization to find out about that. So we get them to do it for us. We push this out into the teams that have the most knowledge. So their test case does this like preliminary check to find out what the API is, and then runs either the V1 test case or the V2 test case. And that's how we do it. So it shouldn't matter what version of CF you're running, you can just run DRATs against it. And that's what we're aiming for at the moment. We'll see how we go. Maybe it will evolve. The other thing we've done is we've expanded CF's open source like support for all the different ways that you can configure CF. You can give it different types of external database, not just the default internal one. You can give it different types of external Blobstool. And we've been building out support for these things because these became cross-cutting concerns that every release author needed to do. So we have an SDK release, the backup and restore SDK release. It knows how to back up these kinds of things. Google Cloud is actually the last one. We're in progress doing this right now. And this allowed every release author just to reuse this because when I'm implementing, I don't want to implement my SQL dump and all the scaffolding around it every time. I just want a job I can put in my Bosch release to do that bit for me that integrates with BBR. And that's, this is the model that we have pursued. Just so you know, we do use DRATs for pivotal application service. That's the pivotal vended version of Cloud Foundry. And it's kind of really simple. It's, we run all the open source test cases and then we run some extra ones that exercise the proprietary components that come with PaaS. And that's really nice. And I guess that's, you know, that's how the CF certification process works. You have the base platform that should have all this feature set for disaster recovery, plus some extra bits. And they also need disaster recovery because when we sell to our customers, we want to be sure that the whole of PaaS and all its optional components can be recovered from disaster. This is what PDRATs looks like. This was actually even a little while ago. So there's a couple of extra versions have come up on here. So we do end up maintaining a lot of versions because this is going back to 111. So we've got many versions of PaaS. And similarly, this is kind of quite interestingly, like this ends up bringing up a lot of the, this versioning stuff comes up again. That we want one version of PDRATs that works for all. So what's next? Well, we've found that this way of working, of cutting up the test cases, pushing responsibility out to different teams, and then integrating together has been very successful. It worked for DRATs, it worked for PDRATs. So we've got some more DRATs because like who doesn't like DRATs? We've got BDRATs for the Bosch director. Oh yeah. Because Bosch is getting complicated now. Bosch can have a credit hub, a UAA, Bosch GNS, like the list of features keeps increasing. So BDRATs is trying to cover all of that. And we now have just started, we've got the first version of KDRATs. So we start the K for Kubernetes, K for Kubo, whichever one you like. So that's doing backup and restore acceptance tests for CFCR clusters. So BDR can now back up and restore at CD clusters in Multimaster CFCR, which is pretty cool. I'm gonna hand back to Emmanuel to sum up. Thank you, Josh. Now let's recap what we've seen now. Let's see how it's end again. So we started our talk, asking ourselves what is disaster recovery? We framed the problem and we saw how we can test drive it. Then we presented you, we saw like the first steps we took like towards that direction, then how we distributed the ownership to multiple teams and then tried to integrate everything back together. Then we examined how the actual framework works, how DRATs like break it bit by bit. And we saw what CI tasks we distributed teams like to reuse to create their own pipeline so they will have to reinvent the wheel. And finally, we saw that this is like a living project. It's still going on, it still gets pull requests. We still try to maintain it and make it better. And then we examined what's coming next. So how we can take this idea of this driving a system, a platform and applied like different things like BOSS, like Kubernetes and so forth. That's our talk. So thank you for listening. Do we have any questions I'm gonna put back? Yes. So the question is when more we provide encryption? Good question. This is one of the things on our roadmap that we have avoided so far deliberately as we believe that there are solutions for encrypting artifacts and for storing artifacts. So these things like scheduling encryption storage, things that we have not pursued. Primarily to address our scope and also because we believe that there are products out there that do this kind of thing. It'd be interesting to know like maybe we can talk afterwards if that's something that you want. If we want strong, if we get enough people saying we really want this, of these three things, like which one do you want first? That would be useful feedback. Yeah, sorry, you both put your hands up at the same time. So okay, yes please. Okay, so I think there's a couple of things I've heard in what you're saying. Sorry at the back. I'm not gonna be able to repeat everything that has just been said. I think what you were talking about is the case of pushing an app and then also some app state in a service instance like a MongoDB. Right, so does BBR, is it possible to back up service instance data with BBR? Is I think the question. Yeah, one of the questions. So, and you also brought up the idea like when I do these things like creating orgs and users, I'm creating state in multiple places in the platform in multiple components. Okay, so to deal with that one first, because that one is, yeah, tick is, so the whole purpose in BBR is that we first, you take a deployment, you lock it. Lock means do not change your state. I run back up, I unlock. You're allowed to change state again. How you implement lock and unlock is up to the component. So for example, UAA is quite clever. It goes into a limited functionality mode. So like tokens can still be refreshed and issued, but you can't like create a user in that period, because you don't want the platform just to sort of stop because authentication has stopped. Other components like the Cloud Controller still actually have to go down because there's just like so many things to deal with. So that's unfortunate. We have an API downtime and we're trying to mitigate that as much as possible and reduce it. In that lock window, we get consistency. The purpose is that all the state across UAA, Cloud Controller, CredHub is captured in that lock window. It is consistent and can come back together. That is the point of the lock. So it's very valuable that you have a consistent state across the whole system. You then talked about service instances. So BBR is a framework for any Bosch release. It's up to the Bosch release authors to implement BBR. So I know that some have and some haven't. So it's a mixed bag. The question of app-centric backups I think is something we have not explored as much. We have focused first on the, should we say, platform-operator-centric backup and restore of a platform and not so much at the app layer as there are a lot of different things come up about where the state is. But it's something that we're interested in and we are exploring at the moment. Yeah, it's a big challenge actually. It's a really big challenge. Most apps don't have their state in just one simple place. It can be more complicated. Maybe it's a set of apps like Big A app, maybe several things and several backing things. There's a lot of interesting stuff to do there. I think we have time for one more question. One more question? Right, should we integrate it into cats? Should, you know, droughts become part of the, you know, the great big beast of Cloud Foundry acceptance tests proper? I wouldn't really consider that to be honest. So this is the first time that has come up. So nice observation. So yeah, it's maybe something we can consider. But we don't really know how people actually use it. We just know that teams running it. So maybe that's a way to just like to make sure that always passes at the same like port where you actually do something to actually run droughts. But like at that point, like BPR was an optional feature like an experimental. So you have to enable it like with an ops file. So I can imagine you don't really want like your cats only work like with a default non-optional components. So maybe just BPR is like a thing to go for everyone. It just comes included. Maybe yes, it's a way to do it like as part of cats probably. Yeah, I think that's a good point. Yeah, BPR is now in as it's a non-experimental ops files now in CF deployment. You can enable backup and restore, enable backup store for the different database and blob store configurations with ops files provided. But it's still optional. You don't have to do that. If all your apps are stateless, maybe you don't need it. But so there we have to go back to value and we focus on the restore time objective and restore point objective. And I think if you know what the answer to those questions are, then you know what you need to backup and restore. How can you get back? How can you recover your system? And to think about the different scenarios. In droughts, we just do one, which is everything gone. There are a lot more scenarios for disaster. Everything from manual error to losing disks to losing AZ, losing some storage device that cuts across many things. There's lots of different flavors. So in some ways, this is simplistic and there's a lot more scenarios to think about when you want to think about this for your system and the system you're providing your developers. Thank you very much. If you have any further questions, we'll be out here like to continue discussing because that was an interesting observation. Thank you.