 There we go. Look at that. Technology. Welcome to this. Very thankful for Google for putting on this wonderful room, because this track has just been an incredible technology track. So very thankful I get to share what, to us, is Starke Wayne is something that's been super important, which is both backing up and the, perhaps, 90% more important task of restoring data across many, many data sources. Because Cloud Foundry itself doesn't have a lot of opinions about backups. Like, it's a good idea. So perhaps, I don't know, read up on it. And certainly, when it comes to our users, if you're a Cloud Foundry user and you, like, do cf help minus a, because first you'll type cf, and you'll realize you get some commands. You'll do cf help, and that gives you the same set. And eventually, you find the minus a flag and it gives you everything. And you'll scroll through it, and you'll be looking. And then you'll go, oh, I'll just grep. I'll just grep. I'll try to find the command I'm looking for. And you do grep backup. And there's nothing. All right, all right. What about grep restore? Grep archive. Grep help. Cloud Foundry CLI, Cloud Foundry Service Broker API, nothing really has anything of opinion about backups, which is a pretty bad point to start, because if you can't, it doesn't really matter even if you could backup, because if you don't know how to restore, you're nowhere near close to having a system you can love. And so, yeah, like in the previous talk, we talked about monitoring, being like the base if you can't monitor. The same when it comes to production systems, if you can't restore, it's nothing. It's not production. That's for sure. It's a toy. You can be excited that it works at all, but don't share it with your friends or people you hope to continue to be employed with. So let's go back to that. So from your user's perspective, when I was doing my honors year, I'll tell you my quick disaster recovery. We all have these stories, and perhaps this is why I care about this so much. When I was finishing my honors year, you have a thesis to write, and it might be so many pages. You do not write them progressively throughout the year. You write them all at the end. And in 1996, Microsoft Word 95 was the thing to use. Okay, some people use latex, but I had no idea how that worked. And so, but Word 95 really couldn't deal with big files. So even though my thesis was only 80 pages or so, it still broke it up in lots of chapters. And I was working happily on chapter three. This is how well I remember this. This is the pain that's about to come. It has burned into my skull that I was on chapter three. And I went to move to work on chapter eight. And chapter eight was not there anymore. And so I looked on Windows Explorer, and all the chapters weren't there anymore in their separate little Word files, except chapter three, which is happily still there. And so I was on the assumption that we had backups. I certainly knew that we did, because I'd been told, put everything on Z drive. That's the shared departmental volume that's backed up. So I'd done my job. And it's a platform that's a department of it. And so I walked down the stairs to the IT department, IT support group, and I said, I can't believe it. I've done so much work in the last 24 hours. And I can't believe I've lost all my files. Can you please recover them? And that's when he broke the first piece of news. He said, well, actually, we do backups every two days. Oh, my Lord. I mean, I was already heartbroken for the amount of work I'd lost in the last 24 hours. Because you do know that the phrase daily backup is polite code for 23 hours of data loss. It sounds like a real phrase. Don't know, you hear it all the time. Daily backups, this man knows who he's doing. I'm going to another meeting. So yeah, so two days. So we're 47 to 48 hours worth of data loss. This is pretty exciting because literally a large amount of work was done. And then he looked at the backup drive because it was a thing, a physical thing. And he said, oh, that's interesting. Nothing good's about to come here. He says, this hasn't worked for the last six times. I'm down 14 days worth of work, which to be fair, nothing good had been written 14 days earlier, which would be the basis for continuing my professional career as a human. And this has taken some time. And at this point, one of my peers had sculpted into the room. He had a whole demeanor about him that suggested he was intimately aware of what was going on. And he said, I'm really sorry, but I thought this would be a lot funnier. He'd found my password and moved all the files to another folder. So anyway, it turned out really well. I've never spoken to him ever again. It's nice to know where all my enemies are in one go. So everyone assumes you've got backups until you find out that, oh, we don't? Oh, this is excellent. This is really good. I delegated this all to you and you didn't do your thing. And if no one knows that they work monitoring, no one knows how to do the job at the time you actually have a disaster. Because at the time you need to use the restore function isn't a great day. It's not really the day you want to be doing the lookup on how restores work. So there's a whole culture around the whole restoration process that anything you introduce, you should know how to do disaster recovery on before you share it. And you should train and practice. But then the bigger question is how much you do it? So we've been working on a thing for last year and a half called SHIELD. Obviously being Cloud Foundry people, we use it in a Cloud Foundry setting. It is completely generic. You can use it for anything, to backup anything, run it any way you like and have any difference at the back end. And it's really interesting and it's working really well for us and customers and a growing number of people who we find out that they're using it because they hang out on the Slack channel. So I thought we'd share it. Just to quickly answer the question of how am I, you run it and then we'll get into perhaps how it works and that sort of stuff. Really it's that. Thanks to the wonders of Bosch too, we now have this, it was a branch that we merged soon as soon as that's the next sort of major version. But this is as simple as we can get it to run SHIELD. A couple of little patch files you might add in if you wanna pre-configure things. But there's also a CLI which we'll talk about about how to configure Google for example as the back end. And then you've got SHIELD running on a server because SHIELD itself is the scheduling engine. So it just needs to be there so I can start the jobs 3 a.m. whenever it is you wanna do backups. And so we'll just a quick sort of visualize it with a box. So it's both a web view which I'll show and also a running sort of daemon for running the scheduled jobs which you or systems can set up. So the dashboard looks a bit like this and you can sort of see the visibility of what's been happening. Super important that you can see whether things are working, again monitoring and you might wanna hook this in to other systems so you can be alerted. At the core is this idea of jobs. Sort of a core high level idea of what we're trying to do with SHIELD is yes you might wanna trigger backups and you can do that, we'll show you that. But it's a fire and forget thing. You wanna know that it's gonna keep working. The dashboard and the events are gonna show you that it has been working. Not just the whole system but each job is working. You wanna know that it's not. And so each job is sort of like this join table of different ideas. So a target is a thing that we wanna backup. Whether it's a system component like Bosch database or Cloud Foundry's database or a service instance database. So there's nothing sort of system level only about this. We've been using SHIELD agent inside of Habitat containers and Docker containers and all sorts of things that dynamically so the moment a service instance is created it will re-register itself to be backed up. And it will just pop up in the SHIELD dashboard. The store is where we're gonna send things. So Google, Compute, Amazon S3, Azure's bucket system. And we'll talk about plugins but you can write those for yourself. So it's visible as to how often and when that's a text string so to speak. But I mean there is a language for describing the schedule. Retentions, most backup, most companies will have some sort of policies around retention. It's one thing to have a policy. It's another thing that people know and can see that it's the right policy, eyes on this stuff. So it's right up there. One day probably isn't anyone's retention policy, it just entertained me. And whether or not again visibility is this job fundamentally working? Was the last task that this job tried to run did it work? Don't wanna be the IT guy that gets me turning up and you have to tell me that the last six haven't worked. Please don't ever be that person. Now running these things, you can actually just run them. So either from, there's a whole bunch of different ways. Obviously the schedule works but there is a little button you can run and there's a big, there's a CLI. So for the most part I'm gonna talk about the CLI because it's to the most expressive thing. And there's an API. So I remember when people used to learn about Bosch, the Bosch CLI, and they assumed there was no API. I don't get it, I never got it. Like if there's a CLI and the things over there, it has to talk to it somehow. Guess that there's an API. So similarly, shield diamonds over there, your CLI's here, it's talking to it somehow. There is an API and you could look at that. But the shield CLI is really good and will probably do most of the things you want. So we can look at the jobs and we can trigger a job to run. And this, but I mean this, that you could sort of do this programmatically. If at the time that you're shutting down a service instance or draining your service instance or draining a machine, you could automatically trigger it to do a backup at that time. Obviously we have our schedule, we can visibly see what's going on there and running a good job. None of that's important in the slightest if you don't know how to find the backup that best serves you at this point. And so shield has the restoration is I guess the most important feature that after you do the restoration, you want that system to be off and running again. So different plugins work differently but here you can visibly see it's interactive. I'm a fan of interactive CLI's because if a CLI's not interactive, it kind of is because you go through this iterative process of being wrong. So it's interactive in that you get it wrong, go back to help, you get it wrong, get back to help, you get it wrong and eventually it works. So it's sort of still interactive. So why not just offer a helpful interactive menu? I'm not sure of anyone from the Bosch to CLI teams here but I'd like that to be interactive. Hands up who likes interactive CLI's is completely off the topic, I just, you know. Hands up who likes to have no interactive features whatsoever and all the negative connotations I've already given that makes you wrong. Cool, all right, that was good. Far larger group of people. It's all about how you frame the question. CLI's great but really this is super helpful. Now at this point, Shield is only a sort of baker or so. It is a global thing so future work might be to make it sort of specific to service instances but at the moment if you had service instances only sort of an admin would want to look at this but yeah, you press the restore button and doesn't matter how the thing works it will go and restore it. All right, so let's have a look at the basic architecture. And there's this pattern you can imagine for any service. Now the agent and the database don't need to be co-located necessarily. It's a useful pattern because where else are you gonna put it? I mean agent has to run somewhere putting it next to the database or one of the nodes of the database is a place. For some of the plugins it has to be local. There's a file system plugin. You need to access the file system. Would want to be on that machine. Postgres agent, so the agent talking to Postgres theoretically could be off a different machine because it just needs to talk through Postgres but in this example and sort of set in many lines I use I put them on the same machine or the same container if you're running inside Docker or in the same sort of habitat plan for using Habitat, co-locating them, makes or I guess a Kubernetes pod. In this case, yeah, so it's what's noteworthy in this diagram is that the agent is the thing that talks to Google Compute. The backups are not all coming back to the daemon and then being shipped off in this picture. The daemon is merely the thing that triggers the request to do the backup or triggers the request to do the restore. So a quick look at sort of the details of this sort of join table, the job. Again, the two parts. We have the target, so we configure each target is going to be a specific thing to be backed up. And it's going to be relative to where this agent is running. Since this agent's running locally, I probably don't, in this case, need as many credentials. Often you can run locally without credentials and just save for VCAP or whatever. For anyone that looks long enough and you'll see Engrok, that was me playing on my laptop. So I literally had Shield agent running on my laptop. I had Engrok, which is this remote tunneling thing, coming into my laptop. I had the Shield daemon running somewhere else on the internet and I was backing up my laptop's database because I thought that was entertaining. Sometimes, you know, because you can deploy this really easily, as I said, with Bosch and it just works. You go, this is awesome. And then you're about to give a talk on it and you realize, oh, I don't know if I know how it works exactly the way I think it does. And it turned out I was right. I didn't know. And so, my approach was more painful to me because, and all the staff who had to help me point out all the ways in which I got it wrong. So I learned a lot. And the little demo I put together to end up being these slides, I do want to put together as a little sort of walking through tutorial. Even though that's not how necessary you're going to run Shield, you're probably just going to use the Bosch jobs. To run it, because it's really simple. But to understand how it works in every ways is going to make you more confident that you understand and then you really expand it and use it in all the ways that it's pluggable. We also described, this job also described where we're going to back up to. You might only have described the, whilst you're going to have a lot of targets for all the different data things you want to back up, Bosch's database, Cloud Foundry database, and all the service instances, you might only have one store being the one bucket where all backups go. If you have policy reasons, you might need more. But that's sort of the joint table. And so these two things, the way the reason this all can work for an arbitrary set of data sources and an arbitrary set is the notion of plugins. So here we have the Postgres one. The reason that Shield knows how to talk to Postgres is it doesn't. It just calls out to a little CLI app. So Shield needs to know nothing. But the Postgres plugin knows how to talk to Postgres. It knows how to call out to PGDump. It knows how to call out to PGDrestore. And the Google plugin, it knows how to talk to the Google API. Shield doesn't. So for example, when the request comes through, agent calls out to the local Postgres CLI, which is unfortunately named similarly to Postgres' own commands, but fortunately, it knows which one to use. And then it calls out to PGDump. And then it ships it, tar balls it all up. Does it tar ball a Bzip? Bzip. Bzip's it all up just in case you're specific on that. And then ships it off to Google. And this is what it looks like. I now get to talk at the Google track because I showed Google's product. Thank you very much. I really like their dashboard experience. It's actually really nice. The credentials thing is kind of a little quirky. Like you get this JSON file that you sort of walk around with rather than keys and credentials. But again, that's only if you're outside. If you're inside the machines, they can sort of access. We give them a different set of credentials that they just pass straight through. All right, so when it comes to doing the restore, why did I show this picture? Oh yeah, so restore, really important. Otherwise, you wouldn't need to do backups. You know, it depends how long you want your job. If you're okay with your job just being until you need to do restore, it's probably fine. What are you working on? I'm working on a couple of secret things. Well, one of them, restore. No, no, it wasn't. Sorry, it's good to have this. So it works in reverse. The agent calls out the request comes from the daemon. So the CLI talks to the API, the daemon. The daemon calls out to the agent because it knows which agent to talk to based on the target. And it calls out to the Google CLI, which knows how to talk to the Google. Again, plugins and then reverse with that and impacts the BZIP file and then use PG restore. And you can imagine running a plugin for anything. There are a whole bunch of them that already exist. I'm not 100% sure why they're all clumped together but I guess some of them do have an overlap, but anyway, it's a little messy. So I've put the likons for your benefit. And mine, pretty much most things I do, my benefit. Hopefully my benefit is your benefit. This helped me. So you can sort of see the different places that you can store. If the one you want isn't there, make one. SCP, make a plugin. Doesn't have to go in here. You can put it in your own repo. As long as by the end it becomes a binary executable and has, and I'll show you what it needs to look like. And similarly, for the data sources. There's no way that we're close to finishing this list because when you look at the complete contract of what you wanna be able to do, you wanna be able to do the restore where the system is left running correctly at the end. So we're playing around with using the file system plugin. File system plugin, copies, files. Yeah, that's what it does. It's copies files. Into a visit file. And then restore function, unpack some. It's pretty special. And we thought we could use this for Redis because Redis writes things to the disk. It's all the data's there. And, but unfortunately, when you unpack, we do the restore in reverse. You press that restore button, you unpack. Redis, the process does not know this. And it will carry on. And so that's not ideal. So we may need to write a Redis plugin or perhaps a Redis Habitat plugin because the plugin is gonna need to know how Redis, to talk to Redis, tell it to reload the data or restart or manage that process lifecycle. And that's fine. As long as it's the end-to-end function, knowing that pressing restore is gonna end up with your system working again, that's the contract. That's more important than whether plugins are easy or hard or lots. Then the plugins are pretty easy to write. And they're just, you know, so whether we write them, here they're written in Go. But fundamentally, you write them in anything as long as they fit this CLI contract of these functions. So Postgres one, the Google one, see how they look exactly the same. You start to see that they have different configuration. And finally, so just to, you know, when I wrote this slides, I could have gone as well deep dive on how you write one and everything. But you don't need to know that right now. What you need to know is that you should not go to any more talks after this, except, except Kevin's, which is next, wherever Kevin is talking. You should go to see Kevin. And then you should come and see my Docker talk. So forget what I just said. You should wait patiently to the end of the conference. And then go and look at Shield and how you're gonna start backing up and restoring stuff. Because it really, it shouldn't take that long. You should have success with it far sooner than any cobbled together system you might have done. And I've seen some quite horrendous attempts. Horrendous attempts at backup and restore where they really were only focused on backup. Didn't have great ideas of how they're gonna restore things. Like just taking snapshots. Snapshots are a lot like the file system plugin. It's like saying, yeah, it's over there. How are you gonna restore it? I don't know, I won't be here that hour. I will be looking for a job over there. Yeah, never, here's a trick. Look through your ticket queue. And if you see two tickets, one called set up backups. And the other one says set up restores. You should find somewhere else to work. They are not two tickets. There is no such thing as a backup ticket. There is a backup and restore ticket. It's like, how do you know the backups are working? I just saw them over there. The restore is the most important part. And corollary that is your team knowing how to do restores is really the use case. It's not that there is backup and restore. It's that in the event that we need this that all the appropriate people know how to do it and with confidence. That's the story. It's much a teaching story as it is a functional story. Lots of different ways. It's just an agent and just plug-in. So you could write chef if you want it. You could manually do it like I did in my laptop. Couple of common ones. There's the boss job, which you could use to run, collocate or put anywhere. And a growing number of examples. And then there's also a Terraform plug-in. It's not yet been merged, but Terraform 0.10, they're gonna sort of make, there's a huge backlog of tickets for Terraform. And I think they just decided that they had, they wanted to just close them all, because they're not our problem. So they're gonna have lots of different repos. So this sort of thing should become more accessible. But there's another way to set up SHIELD. And obviously, I mentioned Habitat. Habitat's this thing that came out from the people that did Chef. Chef is very old and we make fun of it. That hasn't changed. I mean, you can't fix certain things. It's, it is what it is. But the people that did Chef have had, seen the world that we want to solve and they want to make it really easy to build sort of self-organizing peering systems. So you want to do Postgres clusters? It shouldn't be hard. It shouldn't be hard, except if you go to the Postgres Docker image, the canonical one, the underscore slash that one. The one that's the standard. It has all the important description of what Postgres is. Highly available, continuous streaming archives. Sounds wonderful. That Docker image can't do any of that. Because it's hard. We do a lot of stuff with STD, your console to form clusters. I've gone off the track, but I have time. Then, you start to make up stuff and thank God for STD ZooCupering and console for the ability to coordinate locking and lead a election, those sorts of things. Then you've still got a lot of bespoke code for templating and the whole process of changing the role of, say, Postgres from leader to master to no longer master. That's a very special role. That's called deleting everything. You can't go from was master back to master. That's not a thing. If someone else is master now, you need to forget whatever you were. You need to delete everything. The whole looking after of a cluster through it, the 10 years of its life is non-trivial. And so the habitat is an attempt to provide hooks and plugins to make that interesting. So we find that really interesting. So Shield is sort of an agent that you can use and just run inside the habitat plan. So it just runs, basically, a Postgres will wake up and just start backing itself up. That's really interesting. Please come and hang out with us in the Shield channel and ask questions, get started, add plugins. Finish off if you wanna, you know, you think, no, I don't need any more conference. I need some personal time. I just want me in 250 blog posts. That's what I want. There's 250 blog posts. I wouldn't read them all. Skip all the ones I wrote. No, I'd read all of mine. They're awesome. But yeah, just flick through and see if there's anything that we had to do that was just not obvious, that we just know everyone's gonna have the same problem. So we just try to write blog posts to share these ideas. So a lot of them are little tidbits, little tutorials, little stories. One help with Shield or anything else, please ask us for help. If we can't help you, we'll find someone who can. We'll tell you where to look. And yeah, if you ever have any thoughts whatsoever on Shield, backups, the future, please come to the Slack channel and hang out. Thank you very much. The Slack channel is on the Cloud Foundry Slack, so you go to cloudfoundry, so slack.cloudfoundry.com, sign up. Many, many good Slack channels there. Questions, I forgot that I should ask for questions. To service brokers. Like I said before, the whole Cloud Foundry thing is not really designed. So yes, in the open service broker API world, we could go there as an open service broker channel. We could talk about it from there. My, and a few of us believe that, let's just embed it. So have a restore function. So the Dingo PostgreSQL we did, included a few of these ideas, like allow things to be recreated. And then when they wake up, they just automatically restore themselves. Doesn't really have a good idea for point in time. But no, so there are some ideas, and they're the sort of ideas a person would come up with if they didn't want to change Cloud Foundry. Because you want to help everyone who's already got Cloud Foundry. And some of them are pivotal customers or whatever they have older versions of Cloud Foundry. So to fix Cloud Foundry, kind of says, none of you get to be helped. So it's a little bit tortured about that. But I would like the service broker API to be expanded to have first class ideas about backup and restore. All right, so multi-tasking. So the other part of it is, if you owned a service instance, it was yours, you had access to it. Can we do a filtered view on the shield dashboard that through the UAA said, okay, you get to see this one, but none of the others. Right, so Bosch Backup and Restore saw the talk and so Bosch Backup and Restore has this idea of, all right, let's sort of snapshot Cloud Foundry right now. And so it's synchronized across everything. The implementation of each of those little things might be, let's call the shield daemon to tell it to do something. Yeah, someone's doing something. I mean, James is one of the shield people he said, yeah, he knows about it, he's looking at it. All right, you're awesome. Thank you very much.