 So let's get started. My name is Carlos Santana. I'm with AWS. I'm a EKS specialist solutions architect. I work with a lot of customers, mainly around EKS, but building platforms and happy to be here in Vancouver. So this is my first time in Canada. Well, came back, went to Niagara Falls for one day. So I don't think that counts. So this is the time I spent in two weeks here. Happy to see a lot of familiar folks that I see online and now I can see them in person. So I've been, so before we start, this is what I found for people following me on Twitter. I was asking like, what do Canadian people eat? And like Nikola said, like ruffles and then another person ruffles. So I just went, found some ruffles and then another person says this candy. So if you have more tips for that, say me that. The other thing is, who's on Blue Sky app? Blue Sky app? Who wants to be on the Blue Sky app? So I got my invite this morning. So the only thing that I will ask is, please take a picture, that money shot of me presenting and then I can show it to my boss that said that I was actually here. If not, just come to me and talk to me very introvert. So just come to me, let's talk about GitOps, EKS, platform engineering, cross-plane, backstage, all those things and then at the end of tomorrow decide who I posted or post something Twitter and then ping me. So I just need to pick one person. So let's get started with, my title was too big. I think I was the longest title in the agenda, I feel ashamed now. I should be better at submitting CFPs. So I just shortened it into, this talk is going to be about disaster recovering, which is a topic that comes up when people are using cross-plane and then teach you a little bit about cross-plane. So the new hotness now is platform engineering, like it was DevOps before that, like it was ITIL, who remembers ITIL? Okay, what's the hotness? So in terms of platform engineering, the thing that I usually use is GitOps as a practice. And that's the one thing that actually people are clicking in their head of saying, that one I can start with. So if everything starts with GitOps and this is GitOps gone, then platform engineering, I think it follows a better conversation on how people can get started of building platforms for their internal developers or internal teams. And then the thing is GitOps, this morning people were saying in the keynote that GitOps is not for Kubernetes, but GitOps runs on Kubernetes. So the controllers that we have today and in open source are based on Kubernetes, which has your desired state and current state reconcile, but they run on Kubernetes. The thing is, a lot of the people that I talk to, they want to deploy more teams than Kubernetes. Some folks that I talk to end users, they actually don't run any workloads on Kubernetes. For example, I don't want to functions, like serverless functions, let's say. Or the type of compute. But what they are looking for is guidance on how do I deploy those dependencies that the workloads need, like a database, like an S3 bucket or object storage, and then the things associated with it. So it's not just an S3 bucket or an RDS or some other cloud resources, is those things need to be configured with AIM and then they need to configure in a VPC and it needs to be configured with their networking. So it becomes very entangled of the way they configure that, but they want to have the experience of getups, of a simple YAML, high level abstraction assess, just give me a database, just give me an object store and just make it available to my app, to my pod. And this is where things like crossplane and ACK and crossplane is a CNCF project in incubation. There's multiple companies contributing into it and there's another open source project from a company called ACK. They do similar things of reconciling that YAML that you have in Git or from getups and making so that you have those infrastructures that are outside Kubernetes reconciled, meaning the RDS database, the S3 buckets, VPCs. Actually, we have folks deploying but I think that mentioned that in his talk, deploying Kubernetes from Kubernetes. Look at that. You have a YAML that says I want a Kubernetes cluster with these characteristics and then you give it to a controller which happens to browning Kubernetes but the idea is getups is the main idea of deploying those infrastructure. So the next thing that we do is the YAML and everybody can laugh and say whatever they have to say about YAML. I started working with Terraform deeply in production three months ago. I like YAML, that's the only thing I would say but a lot of people use Terraform and that's something also that I'm working on if you happen to be here and want to talk to I want to create a pattern in the Argo CD community of how do we bridge Terraform to Argo but anyway, YAML is kind of like that the clarity state that everybody uses to get to that cloud resources. But also talking about YAML and then if you like it or not like it there's other abstractions coming up and there's another CNCF project called Backstage which is used as a kind of developer portal it's written in React Native, React that allows you to bootstrap and also like get back information for developers to create the pattern that I'm seeing at least that I wanted to work in create those Git repos that have the YAML templatized so you can ask you like what are the inputs like what are the type of app that you want if you Java or Python, what is the region how big is the cluster, how big is the database and just with that metadata in a nice GUI format you can tell Backstage go ahead and get me a Git repo that then eventually you will need to manage with your pull requests and collaboration on working on that Git repo which reconciles your app in Kubernetes if you have the app in Kubernetes it could be the app running in something like Lambda or ECS or another cloud provider or on premise we've seen projects running HPC clusters on premise deployed from YAML. So then it comes the other part that at least this is the reason I did this talk was everything is about platform engineering but the day that the RDS gets deleted by somebody clicks something the wrong way or the way that there's a misconnection and something that connectivity problems or like brown green networking happening in a certain region, that usually doesn't happen but availability zones, we have multiple availability zones in different cloud providers at least in my cloud provider, we have that but let's say that even you're migrating from one region to another region I'll use that use case then it's not my problem anymore if I'm in the platform team or DevOps team it's the SRE engineering. There's companies and end users that are not that big enough that they have the luxury of having a development team a platform team and SRE team. So these tools and open source projects give you the abstraction and everybody end users running in production and trying to come together into projects open source projects to say let's solve it once and everybody benefits from it. Small shops and big shops can benefit from it. So this becomes the H SRE engineering problem but at the end of the day becomes an IT problem you have to deal with it or if you have to do two teams the platform team needs to collaborate with the SRE team saying like hey we're doing this crossplane team we're doing this YAML team you need to be aware that if something goes wrong we need to work together to see how do we recover? What are the SLAs that we need to be aware of? If we're switching from deploying with Terraform how do we recover with Terraform and now how do we recover with crossplane? And by the way you can run Terraform inside Kubernetes with something like the Flux Terraform controller or the crossplane Terraform provider. So Terraform can also be enabled for GitOps. So let's talk about when we talked we're not familiar with the SRE team on the thinking about resiliency. So we have two aspects of resiliency making an app resiliency the disaster recovery aspect and then the things that happen on one time events that it's okay sometimes you have tasks or runbooks that you have to run manually because it's a one time event. And then we have the other aspect of resiliency where it's high availability when you have for example in Kubernetes you have an HPA configured that you have HA or you have Dan watch Dan's talk he was talking about ARGO CD and in ARGO CD we have a seek scalability to make sure that we have better documentation around running ARGO in production meaning and one aspect is resiliency running ARGO and HA ARGO it can be configured and one of the easiest way to deploy ARGO I think is Helm it has all the values in there to enable HA to put the requests for the pods and by the way a lot of people get this confused including myself that HPAs are configured based on the request not the limit. You can put those values in there and also ARGO CD charting which is the ARGO CD controller configured with multiple replicas if you have multiple clusters so that you can scale them with many ARGOs managing the different our Kubernetes clusters scaling the ARGO controller so those are the type of things that we're working just pitching out the seek scalability that we're starting in ARGO so if you want to join we meet regularly and we're going to have some benchmarking so if you want to contribute around that you're welcome to join so talking going back to disaster recovery is things that happen that are not expected right when things happen if you have a cloud provider they have all these aspects of resiliency in place for you to use so we use that share responsibility model where the cloud provider will give you all the tools they're sitting for you we have resiliency across of them but you have to also configure them properly of like when you deploy something make sure you configure them with anti-affinity so the pods land in different availability zones but disaster recovery is around those things that are unexpected and based on my experience I can say that when we have a CVE event or LSE event my experience based not just in AWS but other components have worked with for the past years is everything fails all the time and the thing with disaster recovery or when things happen like that is they happen all the time but it's not like zero and one it's usually like it works and then it doesn't work or it works halfway and then it doesn't work halfway or it works for these nodes and not for these other ones and the time that you spend to figure out why it's not failing completely because that's the way that we design it that's where you lose time to recover so that's one aspect to learn if you have not done it before is where you have better chance for success if you talk to the folks that are involved with that the SRE engineering team so recovery objectives so this is kind of what we were mentioning AITO and DEVOS practices on the two aspects that you have like how much data am I going to lose when I recover or meaning like yeah how much data I'm going to be lose meaning that when you recover on the new place you don't have that data or that transaction so that thing that you were configuring in terms of a database it depends how often you configure backups so if you configure a backup for one hour increments and something happened it could be that you lost one hour of data and that is the aspect that we call RPO recovery points and then the other aspect is how fast do you recover meaning the downtime that you have the system back and up so those are the two times or metrics that you have to be aware when we talk about objectives and building your SLOs or service level objectives or agreements the next one is a spectrum so when people talk about and this comes into conversations around cross-plane they seem generic but a lot of folks are looking for specific answers when you're working with things like Argo or Flux or GitOps or cross-plane on disaster recovery and I go back to them and we have a conversation and we're talking about the same thing that we're talking about if we're not using Kubernetes at all on how much are you willing to pay to get that one side of the spectrum I would say the right side the multi-active active like I'm willing to spend a lot of resources to have duplicates and everything's active to reduce that RTO to almost nothing versus I don't have the luxury to spend that many resources to have a duplicates or the cloud provider may have also avenues of having that active active I'll mention one of them with the S3 bucket but for example a database or the type of database that you're using you can have a luxury for different environments it could be dev or staging or a new environment that is for a new project that is not in production that is okay to have one hour, 30 minutes until you create a new database and take the snapshot and then put it back or pull back up and restore the database so creating the database can take 10 to 12 minutes getting the backup apply can take maybe five minutes or 30 minutes it depends how much data do you have but it's okay that you're recovering that RTO is higher because it's a lower environment or it's okay or it's not being used or it's a window or you're just migrating from one region to another so let's talk about a little bit about crossplane and this is something that I've been involving in crossplane in terms of as a cloud provider providing guidance of how to do backup, restore for the end users and hopefully bring this into the documentation but one aspect of in crossplane when you install crossplane it's a stateless controller you apply the help chart and there's no need for PVs or PVCs they're just controllers that are stored there and any stateful information is saved in SCD and for those type of applications for example even Argo CD or Flux you will think that if you upgrade to a new version and you say and you find a problem saying everything works but something doesn't work and we have to go back you can just change the help chart or change the version or go to your Git repo where Argo is watching and change the version from for example 1.10.2 to 1.11.0 1.11 for crossplane well there's an issue that you cannot roll back and that's related on when CRDs add new API versions you cannot roll back and there's an issue around that and there's been other issues the community is working on them but as the community works on them who maintains your things in production so you have to take backups for these type of things even for development I do it because I don't want to waste my time on waiting another 15 minutes to create another cluster to delete the other one so there's an easy way that I can say I'm going to update to an inversion or crossplane today to test something or test a pattern and make sure that it works with an inversion I'll take a backup just in case because I don't want to have a problem so take a snapshot of everything in crossplane including everything that is inside the namespace but all the CRDs CRDs and also cluster roles cluster role bindings is easy with for example, I'll show Valero for that another one is provider upgrades and rollbacks so crossplane has the concept of crossplane the core you can say and then you have providers that provide providers that provide providers that implement the different CRDs for example, you will have a provider from Outbound I'll be talking about that that has an AWS one you have the AWS contrib you have one for Helm for Kubernetes all these providers what they do is they are in charge of reconciling the different type of objects that they maintain so they add the CRDs so when you upgrade those CRDs that are managed by the owner which is this version let's say version A of the provider and you upgrade to upgrade B the owner will change to over so there's issues around different providers not handling that well because that doesn't come from the crossplane community open source every provider can implement their provider and they're responsible for implementing that code so one issue out there is who has ownership of the CRD and you can find yourself in a deadlock so that's another reason to have backups and then the other one is configuration updates crossplane has a concept of creating kind of a package that you put in an image and with some metadata that states these are my it has a concept of packaging your compositions in crossplane where you can package your compositions but you can also declare what is the version of the provider for example Outbound or the contrib AWS one what is the minimal version that I need and crossplane will go and fetch that provider and do the up, do the favor to upgrade for you to the new provider but what happens if there's a problem it will upgrade to a version that I have not tested maybe or I have not done the diligence of doing backups so that's another guidance or feedback a tip too for you to consider if you're thinking of doing packages make sure that you upgrade or what version are you going to get depending what are the versions available to me, I prefer to put those things in a Git repo and then have a GitOps controller like Argo or Flux just deployed by compositions if I make changes it's easier to make the changes in Git I push to Git and the composition gets deployed that's my personal preferences preference but you can use both but I would prefer to have control over what gets deployed from using GitOps oh and the last one was Valero, Valero has Valero is a software that is used to back up Kubernetes resources basically the things that are storing STD using the Kube API also has support for pvc and pvc but in this concept I'm just focusing on the CRDs and this is something that you want to do is enable the API group versions because as you create CRDs in crossplane and there's CRDs to go for everyone there's a bunch of CRDs when you go to crossplane there's multiple API version and the open API spec for the CRDs so you'll have to enable this to make sure that you back up not just the latest version but all the versions associated with that CRD so you don't want to be in a situation that when you recover some APIs are missing and then you have a CVE so that's something to look into is using Valero or equivalence if you're using some other partner like Veeam or CloudCasa or any other backup solution for Kubernetes I added this slide because in the floor I'm having a conversation with Greg he has a talk today talking about crossplane on this aspect of serve only this is new in version 1.11 of crossplane you have to enable it with a flag but what this does is crossplane has now an ability to just when you create a resource in crossplane it used to be by default that it goes and creates the S3 bucket or the VPC some users were asking was I want to create the resource in Kubernetes but I want just to observe the thing that is already created and it could be like another team was in charge of creating the VPC the subnets, the AI and policies maybe using something like cloud formation or data form and this other team is going to deploy the cluster inside that created VPC or subnets so they want to observe only and what crossplane community did was deprecated the delete policy meaning that if you delete something from crossplane it doesn't delete the bucket of the RDS database or the database and it also combined the observe only so this is the new table I don't know why it's not in the doc so I'll ask after today but this is there from the design doc on that so you can have this field on the Kubernetes resources for your S3, for your database and you can configure it saying like if it gets deleted from Kubernetes leave it orphaned like do not delete the real team because it has data right and that's something that you want to do in production maybe in staging or dev you want to delete it to clean up but also you can have the observe only or have full control and this is a pattern that you can use to say I'm going to move progress from dev stage I'm proud and I want to maybe have two prouds maybe I'm moving one cluster to another cluster and I want to test things to see if everything in crossplane can see and can see and it can observe everything and then change the knob to make it full control so it allows you to go halfway there to observe and then full to have full control so that has the option there that's in the docks in the repo of crossplane so in terms of disaster recovery this is what most people I don't want to say most people but folks that I talked to they say like well if you're going to do disaster recovery why not have the second cluster the new cluster just install Argo put point to the same YAML that the other one is pointing and then let it configure everything well with crossplane you may have two issues well not crossplane general but the general issue let's talk about the first crossplane so the crossplane issue is like if you configure the things that have the external name that's a field in crossplane that it doesn't match the existing one and you have let's say a thousand databases a thousand S3 clusters and five BPCs you will duplicate and create everything again because you have a different UID so you have to configure your claims how you configure your claims and external IDs that they land in the same name so that's one aspect of making sure that crossplane does that the second one is based on experience and based on experience when you think stuff will work it doesn't we have heard that experience right I'm confident I'm going to start Argo done a dozen times and then Sunday hits and then the thing hits the fan and then your wife is saying like where you're awake is Sunday like nights and you're not supposed to work on Sundays so this is a risk thing like if you're going to have prod let's say prod one and there's something you're going to have and maybe you have a cluster or maybe you're going to create a cluster you have prod two you have to look for the fastest and easiest way to recover not to have all the automation working on the second one at least that's the experience from my personal experience do not focus on getting all the automation that the Argo comes up that they can connect to the Git server because for example things that happened to me and also people have talked to me when they do backups they put it in an S3 bucket everything's fine they have tested everything hits Sunday they want to backup guess what the backup is in the other region that is down they forgot to put the bucket in the other one the connectivity where you're backing up you have to backup in the other one and then this one in the other one so that's one thing the other thing is they use a Git server that is not a GitHub they might have a Git from a cloud provider that is in the other region guess what another one is stuff happens with the thing that fails all the time what's the thing that fails all the time DNS oh hi DNS for some reason DNS decided not to work so anyway I don't run chances so look for the fastest way that you have a backup somewhere close to that new environment and just snap shut as fast as you can so you can get everything working and then Monday will figure out what happened so in that case this is their scenario I'm working on a tutorial we have a repo posted at the end where we have all these examples so people can go through so this is one example that you can configure saying like you have the Kubernetes in the West Coast if you have put Valero on a schedule and it backups that into an S3 bucket and it could be an S3 bucket on that same region but what you can do is configure replication of object storage buckets that says anything that you put in bucket one goes into bucket two and you don't have to worry about the second region so everything on bucket one on a schedule maybe nightly, monthly, hourly Valero will put it in that bucket the cloud provider I guess the one that I work for it will put it in the second one and then it will be there and then what I will focus is when this thing happens you'll have to recover just focus on getting that S3 bucket they've made a data to FCD to the cluster that you're trying to recover and just do that I have two minutes for the last scenario the last scenario is a backup of a database so that was the first scenario was backing up crossplane this one is backing a database so you have your database in one region you have crossplane deploying it you have Argo stating the YAML files and the first thing that you would do is backup for example Valero you can use a pay service or partner service but anything that can support Kubernetes backups in my case I'm using an open source project called Valero that backups to the second bucket located in the other region or you can do the replication of the buckets and then the RDS database using the cloud provider tools you will backup that to an object storage backup that when something happens as the first region doesn't work or this network connectivity or anyway you want to decide that you want to have crossplane installing a different region then you work on the second region and the buckets are there so that was kind of the trick create the second cluster install Valero in that cluster and destination cluster and then it will restore so that restore is one time operation that you will do and it's okay when you're restoring and you're doing disaster recovery it's okay to run a runbook manually because it's a one time event like I said so you will restore and maybe you will need those phishing there is a webhook so some fields in the crossplane will destate the region so you don't want to say west too you want to say east one so with a webhook you can actually change that to make sure that everything's okay create it again in the second region you're recreating now and then the webhook you can mutate that and get that working so that's another tip crossplane will create that your new database is empty doesn't have any data maybe taking 10 minutes or 30 minutes if you want active-active you will need to create it beforehand but in this case I'm okay creating it spending the 10 minutes and then you will back up what you have in S3 and then you can reattach ARGO so we're not we're still using ARGO but this is maybe a task that you will do on Monday on Monday you come in like well let's reattach ARGO that it goes back to the thing oh by the way we can use mutation webhook to make sure all the because it could be a scale you can not deal a scale you can have multiple Git repos or mono repos that are very complex or a lot of Git repos that have another needs a lot of people buying to have a pool and bunch of pool requests too so that's something that you will not do on that Sunday recovering but you can do it in the next week where the webhooks is a new it's a way that you can mutate those resources but eventually you will have all the Git repos match the desired state in the new region and then the webhooks may not may not be necessary or you can leave them there so to as a summary everything fails all the time like I said it doesn't fail consistently that's something that from experience short path to recovery so you want to have the fastest ways to recover you don't want to do configure the whole automation when you're trying to recover different failure domains so always make sure that you back up to the place that you're not have for example you're backing up to an object storage bucket back up to the different region for example another one is I forgot to mention this one because it was pretty funny there was an end user that put detecting when something goes wrong so they can alarm me well guess what if you have a cluster in the west coast region and you're going to put some type of detection that is going to alert you that something's wrong in which region would you put that detection logic or thing that runs who wants to guess in the same region or a different region different region because if the region goes down it goes the detection goes down and then nobody calls you so that's it seems obvious but it's something that comes up like you'd be surprised cross playing rollbacks there's something cross playing is something that deals with a lot of CRDs CRDs are complicated have different API versions so it's good to back them up the auto replica of object storage is good because you back up to one object storage and you don't have to worry about the other bucket it's just being replicated and then the lower cost to recover it like has high RTO but you can do it in certain environments and with that here's some resources a Git repo that my team works on on cross playing examples for AWS Valerial, GitOps and the EKS blueprints which is our for my team of all these blueprints that we use to deploy Kubernetes and EKS using Thetaform, CDK and now cross playing best practice and patterns I think that's it thank you