 Welcome everyone and first thanks for the DevCon sponsors to get us here Today I'm with here with my colleague Alessandro We are part of redhead side reliability engineering team working on managed services with a twist Which we come to later And our talk is about how we Managed to gain some of our or retain some of our sanity in what we do And show what we've built over the last last few months. So So my name is Alessandro I'm a senior S3 a product and let's start and Let's start with our talk with an example application that will help us through the whole talk We have the three standard layers like UI API and DB Again, here we have some an example of the resources we may deploy on a Kubernetes cluster But this is just an example. There should be more. So let's focus on the three green one So two deployments for UI and API and one for the DB it will be helpful later And let's assume you already have these beautiful application running on your Kubernetes cluster Everything works fine. Life is good. But then at some point the depth team wants to Deploy an update for your application, right? So you as an S3 you don't have already big Processes in place. So you're just started and you get from the dev team you get the Good old folder with some yamble files containing the deployment and the state full set for the database You look at it. It's all fine. So you accept the challenge and you deploy it So what do you do you use your beloved cube cuttle command and you apply the whole folder and after a few seconds Everything seems fine. So you're like very happy about how cool Kubernetes is such What a beautiful tool you're using But then you're a smart guy. So you go and check did this actually work So what do you do you go and check? The deployment and the stateful set using your block cube couple and you see that everything is running fine, right? So now you're super happy and you go grab your coffee But then a few moments later when you go back to your laptop you look at your slack icon and well, it looks like this and You go and check one of the chats and you have your annoying dev that tells you hey, man the new UI is down. So what what have you done and then panic starts and You go and look and what the hell? I mean The pods for the UI are all crash looping. What's going on? So panic panic keeps going But then you have a moment of sanity and you remind yourself that you can undo what you just did, right? So you regain your call and you just undo He's just basically rolled back the deployment for the UI and you think that everything everything is okay But then after a few seconds, then you're panic again because it's still crash looping. So what the hell is going on? and You write to the same dev in the slack chat and the dev tells you well the ODI Don't work with the new API because we make some breaking changes there So you're still panicking but you think I can think about doing it again So roll back also the API deployment hoping that the database have no changes, right? you try to do that with the same command before and it rolls backs actually rolls back, but At this point you gain we regain your calmness because you see that Everything is running fine. So luckily the database had no changes. So rolling back everything So the problem temporarily this is a typical day in the life of an SRE So things keeps breaking randomly and you don't know why but let's try to see what happened The events that brought us here. So we received some YAML file from our dev team We trusted them and then we use like our cube cuttle tool But in the end we could have also used customized or helm to do the same thing and Those tools what they did they just like pushed resources to Kubernetes and let Kubernetes Do the reconciliation? I mean I said just because that's what they are designed for so they worked as expected And now I have a question for you, which is which step is to blame for the failed update? The developer Yeah Exactly. Well, no, the YAML were fine, but things keep breaking for some reason So we don't know why the content of the container may be broken But the reality is that in this picture none of these steps are to blame for the failed update So we actually have a problem from an SRE point of view because our tools are like well I mean we did our job. So don't don't look at us for the failed update Okay, it's not our fault, but in the end as SRE again the update failed, right? And the update could have caused our SLOs to fail as well And the problem is that the SLOs are the actually actual thing that matters in the end for an SRE, right? So we have a problem what we as an SRE our mindset is to analyze the problems that we face and try to turn them into Ideas to improve our workflow or tooling blah blah blah. So let's try to analyze the problem we just had First of all, we have loosely coupled objects Every every object that we push to the Kubernetes container is loosely coupled So what if instead we have something that allow us to bundle those resources together in one package? spoiler Second point is the status of all these coupled objects are is distributed and heterogeneous So it's hard to understand It's hard takes time to see if the deployments excited succeeded and all the other resources So what if instead we had an aggregated status for this bundle of objects that tells me if the update actually worked, right? The third one is all at once roll out So you push all the resources together at the same time and then you let Kubernetes do the reconciliation But sometimes this is not enough So what if instead we can have we could have an incremental control rollouts of those resources, right? So for example, we try to deploy the DB if it works then we deploy the API and if it works we deploy the UI, right? and The fourth and last point for this slide is complex rollbacks because we saw that the deployments have an easy mechanism for rollbacks, right? But what if we could have the same concept for any type of resources that we deploy to the Kubernetes cluster? It would be very nice, right? And now I'll hand over to Nico for the second part. Yeah, so I'm now here to tell you why we can't have nice things in what we do Because of scale and compliance, right scale everyone on scale and compliance boo, right? And If we take the ideas that Alessandro already introduced and try to turn into solutions We had some excellent talks during the last days talking about things like hey We need to bundle our resources together, right? So we want something like github's right you put all your stuff into a repository and if you need to roll Something back you change something in the repository and magic happens and it appears in production If you have a problem with status, right? You don't know what's going on. You have monitoring at least I hope you have monitoring Right but there are solutions for this right and Even with four more complex processes like incremental rollouts you can rely on Argo city or Essentially pig your poison, right? There are plenty CICT tools out there who have Plenty Instructions on how to set up sensible rollout strategies and in the end for rollbacks, right? You can make backups Valera is a nice tool for Kubernetes backups Or there was a nice talk here a few days ago about Argo rollouts, which also includes automatic rollbacks super nice So can we have nice things? Well, yes, but actually no so all of these projects are super nice. I highly recommend them but I can't use them and Well first because of the scale that we're operating it My team we are not operating with like tens of clusters or a few hundred clusters We are tasked with operating across thousands of clusters and closing in like 10,000 of clusters Admittedly not all of them have managed services deployed that we take care of but still all of these clusters need to have the ability to Install managed services at any given point in time, right because in the end our customers They don't care how we made the magic happen in the end. They just want their push button deployments, right? They swipe their credit card get their Their quota and then they'll start installing stuff And they don't care that our back-end infrastructure has a hard time coping with like thousands of clusters But this is where stuff becomes tricky because all of the tools that are commonly available They don't quite work at this scale And then there's an even bigger problem Those are not our clusters those clusters belong to our customers If they run in their AWS their cloud accounts Right just music stops because this is a big one. This means Well, one does not simply Send data out of the cluster, right? We can't just take whatever data we want and ship it outside of the cluster because even such simple things as The name of namespaces in the Kubernetes cluster, right? It's just like a folder name in cube they might Contain sensible information about a customer's next big project, right? You can't just take that and and put it into our management systems We also don't want to install random open source projects there Like it might our city is not random But we can't install it on the customer cluster because the Custom might use our goal already We also have a problem because we can't grant ourselves arbitrary permissions because some of our customers get really upset if you Hand over the keys to the kingdom if you will and give everyone functions and we can't go around Proxies either right so we have a few ways to communicate with those clusters And otherwise they might be really isolated. So in the end one does not simply walk into Mordor If you don't own the place, right? but We that's essentially why we couldn't use Available open source tooling and instead set off of the venture of well building our own thing, right and Alessandro will now show how we destructions the problem into smaller tanks Okay, and thank you back. Thank you Nico again So if you carefully read the title of this talk We spoiled the name of the thing we built for this and it's called package operator So now let's try and see what's like a very high-level overview of what's inside package operator and how we could try to Help us solving the problems that we highlighted before So as I said back before like deployments are nice because they have some characteristics that help us dealing with Pods in this case, but what if we what if we could have deployments for whatever we want to get to the to the cluster, right? So I think I don't have focus on the window anymore. Sorry. You stole my focus Yeah So what we came up with in package operator is a concept similar To replica sets so replica sets handles pods, right? What we would like to have is a replica set for whatever and we came up with this object set resource that is part of package operator again and which is able to reconcile a bunch of arbitrary objects and Aggregate the status via probes. So every object set you can watch the status and see if those objects are Working as expected or not It's very important that it's immutable and can be scaled to zero to archive for rollback So this sold already some of the problems tries to solve some of the problems that we saw before and One of the important thing I want to point out is this phase reconciliation thing Because let's let's see it in detail, right? So in every object set we can define different phases and assigns the resources belonging to the bundle to One of one of these phase in a way that Each phase starts only if the previous ones as successfully completed How do we know if they are successfully completed we define probes for each phase that? Checks if the phase is actually completed, right in this case. We have two CRDs that needs to be deployed if some of you is not familiar with Kubernetes CRDs, they are basically a way to Expand the Kubernetes API is defining Extra entities that allows you to do whatever you want in a way like very simple parallel parallelism as you create new entities on a database, right? So you can define your own object to work with And extend the Kubernetes API is right. So here we have two CRDs We have a probes that checks the that they are established so those gets deployed and if the probe succeeds for both of them then the phase is considered to be completed and Only at that point we go to the next phase that contains a deployment that relies on those CRDs, right? Because if we put together in the cluster at the same time the deployment may try to become available before the CRDs are Established and something may fail in there because the new entities are not yet established in the Kubernetes cluster So after the first phase is done then the second one the deployment gets deployed the probes check if it worked and Then the second phase is marked as successful So replica sets are cool as object set as sorry object sets are cool as replica sets are but deployments are more helpful for us because they manage some of the replica sets Internals in an easy way. So we created the same concept for object set as well And we called it object deployment and the object deployment coordinates the transition between object sets and Keeps the history so that you can roll back like limited history, but it can do rollbacks if it doesn't work and creates a new object set when updated and keeps the old one alive until The new one is successfully executed using the probes that you define So this is what is actually helpful and solves most of the problems problems we saw earlier, right? Then finally we introduce another concept that I spoiled earlier called packages. So which helps people to create those object deployments and Packages are as I wrote in the slide like a single artifact that contains all the manifests configuration and metadata needed to run an application and as like RPMs or that packages you have a build phase in which you basically take your YAML definitions of your resources and put them in a folder with whatever structure you want and Add a manifest that contains some metadata about the package itself you can also optionally add the readme file and an icon as you can see on top of the list and most importantly also you On top of like plain YAML file you can also use Go templating capabilities to create resource templates that can be templated and Then you build the package and everything gets packed into a non-runnable container image that you can store in whatever registry you want and When you want to deploy that package you create a Custom resource that can be either a package or a cluster package depending on the scope you want to give to those resources You specify the image that you built before in the spec.image field and then package operator will pick that up and create all the object deployments and stuff for you so here here is a brief description of the internals of package operator and then Back to Nico for Demo time Thank you But essentially our goal here was that we take all the smarts that are that they do find in like Argo and other open source projects and ship them in one single operator into those Customer Kubernetes clusters that we manage So we can then instruct the on-cluster component to do those smarts things on the cluster without requiring us to have outside systems exfiltrating data or Working at that scale right because at this point you can abstract a lot of stuff away and offload that to the cluster directly so what Hope this works What you're seeing here is I have a local kind cluster Is the phone Okay, okay fun fun perks nobody Kindness Kubernetes Docker, so it's super nice to get a Kubernetes cluster up and running super quickly I have package operator running here and now I'm Deploying something so this is the deployment API that Alessandro just talked about And what you see in here is We have one so this is the engine X Example deployment We have one face the deploy face So we keep it super simple for the beginning and in here we have two objects a config map and a deployment like the normal Kubernetes deployment for the actual workload and We define how package operator can make sense of those objects with a declarative probe So here we say okay select everything that is a deployment and check is it available and is Update replicas equal It's equal to status replicas, right which gives us not only the sign is the deployment Available, but also is it updated spoiler. There are a few versions here. I'm making it extra interesting Check the plan and create it Doing stuff It's immediately available because somebody already preloaded all the images so we see okay, there is Deployments there is Config map and pots super nice What sets this apart from a lot of other solutions is that when we watch Because of those probes and everything being connected if something crazy happens on that cluster like somebody Goes in and deletes all the parts we see that status reflected in our generic application deployment and for us as a reason that's already huge right because without Looking at any monitoring system without looking at specific application telemetry We can already see okay something specific is wrong When we look at this deployment Immediately, you know, Kubernetes is very bad itself healing itself. So this is going back to work right away But if you happen to stumble over this cluster because of an actual incident Having a single resource that tells you where stuff is wrong is already super useful So let's now update That's where things get Right, so what I did just now is I'm Multiple advanced concepts at the same time because okay, this is the same object deployment only just patched it with different data So we have a new release but we added a new strange readiness condition So here we see okay, a new probe was added Target config maps and for some reason we ever wrote this thought it's sensible to say okay for this to succeed and our notation is to be equal to the key Very convenient they essentially say okay wait for somebody to do something and if we Look at what the deployment is telling us now You see okay, it's still available. That's nice. We get back to that in a sec and It's progressing and something is stopping it from progressing to be to in the moment because hey, okay the deployment phase failing at the moment because the Config maps probe is failing because well that annotation is not does not exist on that object Right and this could be any dependency be missing from Deployment so if we Something else that is funny here I on purpose renamed the deployment to be to So again Making the jump to a previous talk here at the conference about Roadouts progressive road out But this is a canary deployment stretch your AB deploy right now We have two versions of this application running side-by-side until the new version passes all its roll out flags So they run side-by-side super fine like no customer would see right now that V2 is blocked by something and at the moment somebody is setting This probe right Probo say this annotation needs to be equal to this data key safe bots terminating deployment deployment remains and Only to be too complicated still here right because now the new revision Passed all its probes and now the old one can go away right because new stuff will now succeed Yeah, I may say okay. This is maybe a little bit too complicated when upgrading stuff, but where this becomes super handy is Spoiler updating it again and You know Mistakes have mistakes have more time if mistakes have my cross thousand of cluster stuff gets expensive so What happened here is you know a typo in the deployment pipeline somewhere and an image was referenced it doesn't actually exist So you can't roll it out That's where we see here and Back off this that will not never work right But you see the V2 deployment is still here in operation right and if we check status information We see okay The latest revision is unavailable because the deploy phase is failing because the deployment status condition is not available This is super useful for us because imagine again your SRE You are Being woken up in at 3 a.m. In the morning because some cluster is failing You can check this and you see immediately. What's going on and now if you This is where the demo ends updated to updated again Now it's okay before it's running We too is gone and we free right because we free never worked in the first place so we can get rid of We too was working until before to go over and now everything is fine again, and we wrote a small helper That can also give us like a rollout history on that cluster So we see okay the first thing we deployed a few minutes ago worked the second version worked right eventually after we patched it The third one well no no did never work. So we Mark it in our history as this was never successful and the fourth one worked in the end and This with just a single tool without needing to set up anything else, right? This is not connected to Monetary solutions. This is not needing data to get shipped off the cluster And this is what we are already using in production in some limited cases and want to build more Yeah, so That's what we did Time for Q&A It's not in the right so the question was if the operator is certified by redhead so it's not in operator hub or Officially supported by us because that's something we're only using internally. We have to the project is open source and We're doing our best to keep open source stuff installable and nicely documented So if you want to check it out, you can and if there is enough Interest in it. I'm sure we will offer it in some capacity It's also in the Shed talk description Okay, anything else That's repeat package dash operator dot run