 Hello, hello everybody. How's everybody doing? I think I will start to talk now. So my name is Trine E. And that's George over here. And we're from the platform recovery team in Pivotal. And today let's talk about how to design a disaster recovery plan for your CF Foundation. Before we move into the talk, let's talk about today's agenda. We will first talk about why do we need to have a disaster recovery plan. And then we will talk about disaster recovery as part of business continuity. We will move on to cover the topic on high availability. And then lastly, we will talk about common approaches for designing a disaster recovery plan. And as part of that, we will talk about Bosch Back and Restore, the BBR, like the CLI, that the platform recovery team is maintaining. Well, first you may ask, why do I have to think about disaster recovery? Is this something that I really have to think about? Is it going to benefit me? And the answer to all of that question is yes, you should always have a plan because everything can fail. And when they do fail, they will fail in unexpected and mis-linear way. So yes, we should always have a plan. And because when things does go through the failure mode, you don't have the luxury of running away. You have to go there and actually fix your platform. Okay, so let's talk about what kind of failures that we're dealing with here. So first set of issues can be software related. We all know that upgrade is hard and if you're upgrading your CF deployment, that upgrade can fail and that can occur platform down times. And or unfortunately, your platform is under some kind of security attack and that has polluted your data stores and now you need to roll back. Second set of issues can be hardware related. It is very uncommon for a data center failure, but we should still acknowledge that and this is the possibility. Or it can even be a planned hardware upgrade like you're switching for faster machines, better machines and CPUs and actions like that can generally occur down times and all kinds of risks as well. And let's not forget about user arrows. There are loads of things that operators and application developers can just accidentally delete. Operators can accidentally delete availability zones. That's really scary and or deleting some kind of disk as VMs that they shouldn't have or given the right privilege. Application developers can accidentally delete mission critical apps or CF orgs and spaces as well. And of course, all of the failure modes that we were just talking about in those previous slides will have an active impact on the availability of your platform and services that are running on the platform. Failures may impact availability to push applications. So that means that you cannot do CF push and you cannot roll out any updates for your applications and maybe that's the time you need to do a security updates and that's just really bad. It can also, well, if the failure mode means that your application is not rotatable anymore, that means you're having downtime for the applications and that's really bad and we want to avoid that too. And as the platform operator, you may have service level objective agreement with the users and consumers of your platform. So really having those downtime may mean that you cannot meet your service level agreement that you grid upon before. And that's really what we are talking about here today and we're trying to give you some approaches and ideas that we can do to mitigate that. Cool. So before we go into figuring out how to plan about disaster recovery, let's talk a little bit about disaster recovery and business continuity. So I'm going to start with definitions and I'm going to try and kind of differentiate into these two terms. So disaster recovery is about the tools and the procedures to be able to recover your vital technology infrastructure after a disaster, after an incident. Business continuity on the other hand is the ability to minimize the business impact during as well as after a disaster. And I'm going to try and explain this in a bit more detail. But first, let's talk about business impact. What is business impact and how do we measure it? One of the frameworks I like the most is coming from Enterprise IT and it's called business impact analysis or BIA. And the idea of business impact analysis is that you can measure business impact. The main takeaway though is that it kind of categorizes it into a few different areas. So business impact could be financial. Your workloads are down, so you're losing money. It could be reputational. Reputation is sometimes something you can't recover from, losing your company's reputation. It could be regulatory, if we're talking especially about the financial industry. It could be life or safety, talking about healthcare industry, or it could be legal. So these are just the areas that we can measure business impact in. So we mentioned about business continuity being kind of aiming to minimize business impact. I'm going to give you one more example to sort of differentiate between disaster recovery and business continuity. And in this example I'm going to use tracks. So imagine you have a track which is carrying a very time sensitive payload. And that track needs to deliver that payload from origin to destination. Now the incident is that this track gets a flat tire. And your disaster recovery plan is to carry around a spare tire. So when you get a flat tire you replace with a spare tire and then your track can continue its journey to deliver a time sensitive payload to destination. So this is recovering from a disaster. We lose a tire, we replace it, we finish the job. How about surviving a disaster? This is about business continuity now. In this case we have two tracks, right? Because it's so time sensitive this payload that we can't afford to wait for the tire to be replaced. So when the incident happens, when we lose the tire from track A, all we have to do is move the payload to track B. So track B is going to deliver the payload to destination. So this is minimizing the business impact. In this case business impact is possibly financial. So if I don't deliver the time sensitive payload on time I'm going to get fined. Combining both is actually a full plan, a full business continuity plan. So not only you need to minimize your business impact, deliver the payload on time, but you also need to recover your previously broken down track. So combining the two is actually how to deal with disaster. So I hope this kind of makes the differentiation between business continuity and disaster recover a little bit more clear. Another way to approach this is to think of an about time. So when the reason is in that our first kind of priority is to minimize business impact and then to return to a previous good state. And today we're going to talk about minimizing business impact using high availability terms. So designing a platform for high availability and therefore making sure that in the face of a disaster we're going to minimize downtime and recovering to a previous good state is going to be the second part of the talk which is going to be on disaster recovery. So let's start with high availability. First things first, we need to identify our availability needs. And we're going to use service level objectives, availability service level objectives to measure our needs, measure our requirements. And availability SLOs are basically the agreements between the consumer and the provider of a platform about the minimum set of availability for that platform or workload. So if we talk about SLOs for workloads, so the applications that are running on your platform, we need to identify the business impact of a particular workload being down. And again, remember the business impact analysis framework I mentioned before, five types of impact that you can measure. And we also need to keep in mind that not every application at every workload may have the same SLO. So in some cases you may have some workloads that are more sensitive, therefore need a higher SLO, five or six nines as people like to kind of use the terminology. And you may have some other workloads that are less sensitive. Now let's talk about the platform. It's very important to differentiate availability requirements for the platform and the workloads. Now for the platform, again we're measuring business impact, but what is the business impact of the platform being down? Not the workloads, just the platform, just the control plane. Now examples here could be loss of productivity, right? My platform is down, I cannot see a push. So I have 100 developers just sitting around doing nothing, not great. Auditing observability, my workloads are running, but I really don't know what's going on. I've lost my platform control plane and I have no metrics. Security patching, that's another big one. So again, measuring that impact of the platform being down and kind of coming up with an SLO is really important. And once we have these SLOs, then we can start thinking about how do we implement these SLOs. And there are obvious things like, hey, don't run boss jobs with one instance, always use multiple VMs or always use multiple availability zones. But the one I want to emphasize on is the idea of using multiple independent foundations, possibly geographical distributed. So different deployments, basically. And there are some challenges in that. It's the most expensive way of achieving HA because obviously you have to duplicate and also keep foundations in sync. And there are multiple patterns there. Active-active, for instance, is an example. As a pattern where both foundations are serving traffic, or active-passive is an example where you have two foundations, one serving traffic and the other one just waiting there for a failover. There are challenges, as I mentioned, keeping them in sync and generally maintaining them and managing them. There are challenges around data services. How do you have two different foundations accessing the same data services? There are challenges outside the platform such as traffic. How do we properly route traffic or failover traffic in the case of a disaster? But this is an interesting area to look into for high availability. Now let's talk about disaster recovery as a second part. And again, we need to start with measuring our needs and measuring our disaster recovery requirements. And in this case, I'm going to introduce two metrics. There are RPO, or Recovery Point objective, which basically is deciding how far back we want to be able to recover. And that has to do with how granular you want to recover it to be. Can you afford to recover to the last week's states? Can you afford to recover to yesterday's state? How frequently is your platform changing? And then there is also RTO, Recovery Time Objective, which basically is how long does it take for recovery to happen. So what does recovering workloads mean? What do we need to recover when we talk about recovering workloads? There's, of course, application code, which you may or may not be able to repush it. There's application configuration and environment variables, and service bindings, and there is application data. And when we talk about recovering the platform, things we need to consider is infrastructure itself. So any VMs, any networks, any storage devices, the configuration of the platform, things like orgs and spaces, things like security groups, and any installed build packs, custom build packs, anything you have on the platform. So how do we define RTO and RTO? I don't have great answers for RTO, but for RTO, one thought I can give you is think about what you've done for your availability solution. How is your topology? Do you have one foundation? Do you have multiple foundations? How are these foundations if you have multiple working together? And then base it based on that. So if you have a single foundation, obviously you're going to have a very low RTO. You need to recover really fast because if you lose that foundation and or workloads on that foundation, you need to recover really fast. But if you have multiple foundations, if you have invested in having multiple foundations serving traffic, then you can afford to have a higher RTO. Maybe it can take you like eight hours or a day to recover. Maybe that's fine. So once you have thought about the recovery objectives, and we can talk about solutions, and in this case there are two main schools of thought. There is a backup in your store approach, which we're going to talk mostly about. And then there is the automate and recreate approach. And these are not exclusive, they can be used together. But one thought I want to leave you with is think of both of them as a tax. Because any investment and any cost that you're paying towards these approaches, either taking backups or automating, doesn't necessarily have an immediate return. It's a tax that you pay that you may or may not get value from in the future. So when we talk about automating, recreate, what tools are there? There is obviously a boss bootloader, a very good tool to kind of putstrap your infrastructure, recreate your boss director. Conquer CI to maybe repush and redeploy, actually redeploy your foundation. Any CACD tools to repush applications. And there is a tool called CF management without any vowels, which you can use to declaratively define and push configuration to your foundation. Orgs and spaces and any other sort of configuration. The idea there is that if you can find a repeatable way to recreate your foundation, then you can then ask your developers to repush their workloads. And that's a bit challenging because that means that developers have to agree on a CACD solution so they can all repush their workloads in a seamless and kind of consistent way. But that would be the theoretical approach. The other approach is backup and restore, and Chun is going to talk to us about BBR. So let's talk about Bosch Backup and Restore. The Bosch Backup and Restore, what we call BBR in short, is a CLI tool for backing up and restoring Bosch deployments and Bosch directors. The BBR CLI is responsible for orchestrating the backup and restore workflow. And at the same time, it provides all kinds of hooks so that stateful individual Bosch releases can implement their own backup and restore script. The important bit here is that BBR, as the backup and restore tool, actually does not have the full knowledge about how to back up a Bosch deployment or the Bosch director. It only provides those hooks, again, like the orchestrated workflow. It relies on the release authors themselves to write backup and restore script. So the reason behind that being we think release authors are the experts of their own releases. They know how to back up and restore their own releases. And as the platform recovery team, we do not want to dictate that. Let's briefly talk about what's the data in Cloud Foundry and what's the data in Bosch directors. So we know what we're trying to back up here. So in CF, we have individual components that are like, well, Cloud Controller, UAAs, the router, and they all have states and they store the states in the database. It can be an internal MySQL that's deployed alongside your CF. It can also be an external database that you're using, like RDS or Google Cloud SQL. On the other hand, we also have stage applications, like what we call droplets, and those are stored in the blob stores of CF. Similarly in Bosch, the director stores its data in a database and also compiled releases and packages in its blob stores. So in short, when you're using BBR to back up a Bosch deployment or a Bosch director, you are backing up some sort of SQL database and the blob store. Okay, so let's dive a little bit deep here. Imagine that you're a CF operator and you want to try out BBR and you just type BBR backup CF in your command line. What that's going to do is BBR would then find all the VM that's associated with this deployment called CF. Then it will SSH onto all of those VMs and try to find and execute all of the backup-related scripts here. So because those are backup scripts, there will be backup artifacts created in those remote VMs. And BBR would then be responsible for transferring all of those remote artifacts back to wherever you're running your BBR command from. And of course, if the goal is to be able to restore your entire platform, only backing up your CF deployment is not the end of story here. So you also need to back up your Bosch director, which is covered by BBR. You also use BBR again to back up your CF. And any kind of data services that your application may be relying on. So if the data services are also deployed as a Bosch deployment and they also have BBR scripts implemented, you can use BBR to back those up. If not, then you have to seek alternative solutions to back those up yourself. So all those are just basics. And you're like, okay, I already started using BBR to back up my foundation. What else? So let's talk about some good practices that you can keep in mind when you're trying to back up your foundation. First is the frequency of the backup. You should always, always align the frequency of your backup to the desired recovery point objective for the platform. So as George talked about before, recovery point objective is the time that you're going back at that given restore. So for example, if you're restoring from the backup that you have taken three days ago, then the recovery point objective here is three days. And if your desired recovery point objective is three days or four days, then you're good here. However, if your desired recovery point objective is two days, then it's not the ideal case, right? So you have to align the frequency of the backup to what you want here. Second, maybe consider using external blocks or a database. That way you're diversifying where you're putting your state and hopefully that can mitigate the risks when things go wrong. And always try to put the job that's going to take your backup in a pipeline. I think it's pretty basic knowledge here that you can't rely on one single person to take your backup every three days. That's not going to work. The person is going to be sick or not remember it. And so always try to put that in some kind of CI CD solution. And the good news is if you're already on concourse, there's a set of tasks that can take a backup of your foundation and it's maintained by the platform recovery team. You can go check it out in that link. And lastly, we'll definitely try out restoring process once, at least once before you have to do it in real time. Because the restore process can differ based on how your deployment, how your platform, your deployments are deployed, and it can also differ based on what is feeling exactly. So getting yourself familiar in the game is definitely going to help at that once in a lifetime restore time. Cool. So just to close, backups are attacks. Automation is attacks. Backups can be alerts. They can be slow. They can take lots of torrents to save them. Automation is also attacks. Agreeing on a common CI CD with your developers may not be very easy. Thing to do. Maintaining your own automation for your platform may also be expensive. These are, and this is also continuous investment. These are attacks worth paying for, though. And it's up to you to choose which stocks you pay and how much of its stocks you pay. I mentioned before that automation and recreate versus backup and restore are not mutually exclusive. You can choose to kind of have a hybrid approach where you do some of automation and recreation versus backup and restore. That's all from us. We are at the BBR Slack channel in the open source Slack. You can find BBR on GitHub. And are there any questions? All right. Let's call it. Wait, wait. So if there are no questions, I got to ask one. I'm sorry. No, you have to stop from this. So I like the talk about recovery point objectives. So in your experience, what is the, like, let's say the minimal RPO I can get to for a reasonably sized Cloud Foundry installation or foundation? Is it in the order of days, hours, minutes? Like, where can I get to? The minimum. So we've seen people taking daily backups. So it means an RPO of up to a day. We've seen others try and take hourly backups. I don't think, I think the minimum, it really depends on how much you can spend on backups. It's a trade-off, right? If you can afford to take hourly backups, then yes. If you have, like, external blob source, so taking a backup of the blob source is very cheap. And you can, in the relatively small foundation, you can take hourly backups and that would be great. It also depends on how frequently the foundation is changing. We've seen some foundations with very specific set of workloads that are not being updated on a daily basis, necessarily. So a daily backup may make sense. I think most common frequency is daily. So an RPO of a day, if that answers your question. Sure. Thanks. Yeah. So the question is to do with Active Active Foundry. I'm sorry, I'm repeating it for the recording. Active Active Foundations and how do we, what are we seeing in terms of data and data replication for Active Active Foundations? Unfortunately, I'm not the best person to answer that question. We've seen deployments of MySQL, for instance, or variations of MySQL with the ability to be multi-site. Synchronization is a very expensive thing to do. So I don't have any very specific kind of answers to that, but I can definitely connect you with some people who might help. So you call me, Marko? Awesome, thank you.