 Yes, hello everyone. Today, we will talk about the journey we had when we tried to integrate GitOps in delivery pipelines or in workflow or process-based toolings. So my name is Thomas Schutzer, I'm a principal engineer for Dynatris and maintainer of the captain project. And I'm here with Brad McCoy. Yeah. So at first, as we will talk a bit about captain today, just wanted to describe what captain is. So captain is a tool which should help you keeping your deployment stable and keeping your application stable. And it does this by putting data or service level objectives in the middle of your process and helps you integrating different tools and controls everything in your process. Okay, so what would be the goals we would want to cover when we are using captain? At first, we want to do some kind of release validation. So for instance, I want to deploy something and afterwards try to find out if everything works as I expected. Therefore I can use my observability solution just like Prometheus or Dynatris or whatever. Then the second reason why you or why I want to use captain would be that you might have a standard for your observability tooling. So for instance, you define your quality criteria once and can use it with whatever observability solution you like to use. And third thing is you can use it for vendor neutral tool integrations. So you define your workflow once and by using cloud events, you only need a consumer for everything and which helps you executing everything. Last but not least, we can take auto remediation actions. So based on our observability solution, we can define some kind of events on which we listen and if something fails, we can take actions based on that. Okay, so enough about captain. So we said that we want to talk about our GitOps journey and the first thing we deal with when we had to deal with GitOps in captain was that captain was a very imperative tool. So there was no Git based configuration and you had to do everything via commands. Therefore, we thought that it would be a good idea to create some kind of declarative language for captain deployment and build some operators around this. So there were some, I don't want to get into deep into this graph, but it should show how hard it was to implement captain out to implement GitOps in a imperative application. One thing, so we found out some things. So at first I thought this would be GitOps. After some time I spent on GitOps and deal with some things, I can say, no, this isn't GitOps. This only covered the configuration of captain but not the configuration of our deployments. Furthermore, we had to deal with our own artifacts. So we had an own captain repository where we expected our artifacts. Therefore we also had to deal with them. The third thing we had for deployment, we had no real desired state. So we had some kind of deployment configuration. We copied something, we modified the configuration, but in fact this was not real. And last but not least and this was the hardest part was we didn't use Argo or Flux. We implemented our own GitOps controller and this was hard for everyone who wanted to use it. So I think there is a slight missing. So yes, Brad will tell you how the real GitOps implementation looked like. So here we can see we first started implementing Argo CD and captain. So we used rollouts and flagger as well but one thing we wanted to make is a solution where you could either bring Argo or Flux. So in this scenario, the problem that we were seeing is that when you were just using GitOps and you were promoting through environments, it was hard to do real testing and specifically, so we used Prometheus for metrics and when we've done deployment, we want to verify that everything's okay and then use that as a quality gate which I'll show you soon to progress to the next environment. So you can see here that the GitOps controller will see the change and then that will send a cloud event into captain. So you can use the Argo notifications controller or you can also use Flux using a provider and alert and then we've got a small integration that will turn that into a cloud event and then start the captain sequence. So you can see here that we're gonna run tests, evaluate and then we create a PR to progress to the next environment as well and the PR can either be auto approved or you can ask for approval as well if you wanna go to production for example. So this is sort of what we started with before we tried to integrate captain. So everything's in the main branch. We would, the first, I mean, you could change this, it doesn't have to be these environments, but it's very simply just copying the settings which has all the application configuration, pretty much the environment variables and then the version of the container as well. So when it's promoting through environments, it's simply just copying from the folder. So you can see here we have a folder for each environment and it will simply copy over the settings in the version. So for the demo today, we're gonna just show a quick demo of there's a app that we like to use called Podtato Head. So with this app, this is done by the app delivery app. Yeah, and you can change its arms and legs and we like it for demos. So yeah, the thing that we wanted to add to this was testing like load testing. We can see here that when using captain, we can have a quality gate between environments. So you can define your SLOs and SLIs to the accepted criteria to progress to the next environment. So for the load testing, we're running K6 load test and then we're querying the metrics from Prometheus and then if that passes, then we can do the get promotion to essentially copy the file from the settings file from the load and then copy it into the integration testing and then Argo or Flux would pick that up. It would not see the change and then the journey would start again. So it would get the change and then it would trigger captain with a cloud event and then that workflow would run as well again. So this is what captain looks like. So you can see here that this is where you define your services. So in this demo, we're using around three services. So we're using the Prometheus service. You can flip that up for Dynatrace, Datadog. There's quite a few other integrations as well. So the thing I like about it is it's good with interoperability. So you can captain's not really opinionated. So it's just a declare and saying, okay, I want you to go do an evaluation, for example, and then that service will subscribe to that event. So captain will send out the event saying, okay, we're on the stage. We're gonna trigger this event and then these services that are subscribed to it will then go action that and you can have the event triggered, started and finished and failed as well. So this is what it's gonna look like. So I'm just gonna pretend that we did a get change. So I've just done one before this really. So we now come here and we wanna say, I'm gonna start the load testing stage. So if someone's done a feature and they put it back into the main, we're then gonna kick off the workflow. So this will trigger the sequence and then that will now roll through the environments. So you can see this automation step. It's gonna go right from the feature into production. So it will go through the load testing, integration testing, QA staging and then the last step will have manual proof. So it'll make a PR and then we just click approve merge and then that will roll up to prod as well. So this is during the load test now. So it'll actually be using K6 to throw a lot of it because one thing that we found is when we wanna query Prometheus, normally it's for four or 500 errors to see if our application's working. We found that when there's no traffic, it's hard to get a proper outcome if it's actually working. So therefore we added more API testing. We've just integrated trace test as well and that can do open telemetry traces and get evaluations from that as well. So you can see here that this load stage worked. So K6 load testing was good. We've seen that the traffic went through. Then it's seen an event to use the evaluation which Prometheus was subscribed to it and then it did the evaluation and then it said, yes, we're happy with these results. These met requirements and then it will do the get promotion as well. So for this get promotion, it's just auto-approve and then it will go to the next environment. So you can see that then we went to integration, we did some trace testing, we evaluated that, we promoted again. So in what it's doing at the moment through each environment, it will just keep doing PR and then just automatically merging itself as well. So the get-ups tool, every time it notices the change for the environment, it will scoop it up and deploy it and then run the test again. So we're in the last step staging now as well and then now it's finishing and then that will create a PR to be able to merge into production. So you can, and we're cataloging that microserver starter as well because if we have like 30 QA environments, we wanna know which versions, et cetera, and the environments. So we're finished and then you can see the PR links here and then that said promote to stage production and it will make the PR for you. We wanna do things like reporting as well so we can report back on all of the testing and things we did and then a developer or DevOps engineer could come in and just say, okay, this looks good to me, let's put it into prod. So you can see the files changed. In this case, it was very simple. So I wanted to change the pod taro heads hat. So at the moment he has this hat and that's made it so all the way to a PR and then I can merge that PR and then now that's sitting in main branch in the production file for the settings and then in this case, we're using Argo that will then scoop up that change deploy it and then we'll see that his hat will change soon once that sings. So it's probably, it might take a minute or two to come up. I can maybe just refresh it here. Yeah, so that's singing now and then you can see that my change to update his hat has progressed through each environment and we've tested that everything's okay and it's made its way into prod. So some challenges we have at the moment as well is how do we deal with roll? Sorry, have a question? Yes, yes, I'll show you now. That's a good question. So these are, yeah, each, so this is using customized, you can do some of the things with HALM as well. So for, this is all on main branch so we're not doing, some people do on branches but it's not really scalable. Some folks say it's a bad practice to have on branch. So this will all sit in main and then it will just simply copy the one to two and then an example, if we go back here to look at the old PRs, we can see that closed, you actually see that did the order approved so it actually just merged itself. So it's keeping Git as a source of truth and trying to stay true to the GitOps principles. Yeah, so that's sort of how we've done it at the moment and then Thomas now will talk about our findings from this, things like rollbacks are very difficult and he'll talk a little bit about what kept the MOOCs in the future as well. So what's how we're going on with time? So, yeah, some of the major challenges that we had with this is that obviously rollbacks are very hard, if something's broken and there can be so many different scenarios so if you keep rolling back and that fails, wins enough to keep going around in the circle. We found it to be quite ping-pong right where you, captain's telling Argo. I think you saw at the picture of the whole Argo process that we did a lot of ping-pong. So we took a look in the repository if something changed, if something changed, we notified captain then captain did its part, notified Argo and so on. So this was kind of a ping-pong game so it worked pretty well but it didn't feel really, really right. Yeah, yeah. Especially when you have things like rollouts and flagger as well that somewhat do that capability but we just wanted to go the extra step to have, you know, bring our own tools for whatever we wanna use. In context, yeah, sorry to have a question. Yeah, feel free to just answer as we go. I have a question regarding scalability with get-ups, like how many deployments do you, can Argo or this tool support, can it, like per hour, can do 1,000 deployments, 10,000 deployments, is that something you've tested? I guess we haven't tried it at that scale, have we? This has been pretty small, pretty small scale but I think we would run into some challenges as well because if you have a really big development team and you just wanna keep being features in and in, it's gonna kick off these workflows as well so if you, one thing with the Git promotion services that gets quite complex as well because when you start having editing PRs, when you have more than one PR, it can get quite complex. But in general, if the question is it with just get-ups itself or with get-ups and kept in the way we're doing it? Yeah, with get-ups, I heard, I think I was at the Flux project many other day and they were talking about someone, I think they showed a Terraform controller using like 1,000 deployments at once so it very much can scale so you can just give it more memory, the get-ups tool and it generally goes pretty well. Yeah, I've heard of quite a few deployments. I can take the other two points. Two other things we also found out, we need some kind of context information to know in which sequence service is at the moment. So what you saw in the picture, I showed you before, or in the, on the Captain UI, was that there are tasks which are connected between each other and this is propagated via context information and if you think about get-ups, this might be pretty hard to get this through a pull request or through a git repo. So especially for the whole ping-pong game we did here, for the change detection, kick-off and so on, we had to find a way how to propagate our context information through the get-ups tool and this was pretty nice for Argo because they had some kind of information we could pass. In other get-ups tools we had, this was not this easy and this was also a problem we faced in our G-SOP project afterwards. Yeah, because if it's opinionated payload then it's hard to get the things that we need. So with the Argo notification controller we can send, for the application sets we can put some labels and then it will pass the labels and get that as well and then for flux we did a small integration that will take the payload into our web service and then it will transform it into a cloud event and send it to the captain. So we have 10 minutes left. Thomas, would you like to talk about the... Yes, so as you might have noticed we faced a lot of challenges with this workload-based approach and our workflow-based approach and pipelines and the last few months we put our heads together and thought about how such things could be looked like in a cloud-native and get-ups away away. So without delivering, without pipelines, delivering without workflows and simply rely on get-ups and on all the things we all love. And one of the results was that we create the captain life-cycle controller which is a small Kubernetes which is a typical Kubernetes operator which helps you achieving exactly the things which captain did before or still does. And everything starts with your manifest. One thing we took care about was that you don't have to change much in your manifest so you can use the captain life-cycle controller with your default deployments, with your stateful sets and so on. So you don't have to change your Kubernetes primitives. The only things which are relevant for us are that you have to add some annotations but I think that's obvious in this case. You can use Argo for that, you can use Flux for that, you can use your pipeline tool and so on. Use everything which applies something on the Kubernetes API can be used. And what our life-cycle controller does is it puts observability around all of the things you are doing. So for instance, you see when your deployment started, how long it took and how long it ended and when it ended. And one thing we took care of because this was also a problem we had with the last approach is we are trying to do or we are not trying it so we are doing this application aware. So for instance, you can run some kind of pre-deployment tasks and pre-deployment evaluations before the first service of an application gets deployed. So for instance, you can check if there is enough CPU left to deploy your application and only after these tasks and checks are executed then the workloads themselves get deployed. The second thing we are doing around this is we are using standardized task execution. So you simply have to annotate your deployment and say which task you want to use. Behind this, there is a kind of definition. This can be enhanced as you like or whoever wants to write an integration is able to. And you can exchange this however you want. At the moment, this project is around two or three months old. We are only supporting functions but this can be contained or serialized in the future but this can also be external control planes such as directive or such as captain now where you can send your cloud events to. And a typical workflow of this looks like this. So at first we apply our manifest to our Kubernetes control plane. Let's say we are using Flux. Afterwards, if there is an app detected so if the manifest is annotated with an app annotation we are waiting for an app custom resource and are dealing with our pre and post deployment tasks and evaluations. After this has been finished we can do anything on the workload level. So we can check something before the pod gets scheduled and can evaluate if this matches our requirements and after everything is finished we simply release the pod. The same for post deployment. So after we detect that the pod has been deployed we can run some checks, some tasks and also evaluate whatever we want and after we found out that all of the workloads in our application are deployed we can also do post deployment tasks and evaluations based on that. So if you want to find out more about the captain lifecycle controller I'll be at the captain booster next three days and I'm really happy to talk with all of you about all of this. I think it's a pretty cool tool. Some things we found out so to wrap up the session today we think that the GitOps and captain fit perfectly together in terms of use cases. The initial approach was not this GitOps friendly but I think the approach of the lifecycle controller will make everything better. Then what we found out was that adopting GitOps in captain or typical workflow tools might get really hard because at some point in time you're simply pushing something to a Git repository and have to act on it and you always have to find out when should the tool continue and how can I promote to the next stage. Yes, with this handover tool. So there's a few more talks for captain today you can come for free as well to the community day so you can see the schedule here that's in level three. Okay, it might have. There you go. Yeah. Okay, there it is. Yeah, I guess we'll take questions now because we've only got a few minutes, yeah. So does this tool also support like blue, green deployments? No way. We decided not to compete with things as with Flux or Agorollout because we thought that they can do it this better than we but we can extend them. So everything we're doing in terms of pre-deployment tasks and evaluations and tasks in general can be used from the captain lifecycle controller but if you want to do progressive delivery and want to do blue, green deployments can release and so on I think it's the better choice to use the tools which are already there. Because you can still use an Agorollout CRD that would do that for you as well so you can do blue, green, canary with that. Yes, so you can also use our tasks and evaluations for Agorollout of LEGO so that's also no problem but you get the benefit from that you are more or less tool agnostic so you can switch between the GitOps tool you like very easy and we hope to bring some standardization into the tooling landscape. Any other question? With this new lifecycle operator controller approach does it feedback the health information to the GitOps tool like Argo CD when it says the health and status of your app does that include what captain knows about or is evaluating? I'm sorry I didn't hear too good. The new lifecycle operator approach so if you're using it with a CD tool like Argo CD and normally Argo CD gives you back the health and the status of your application and how it's doing, all the components are doing the things that captain is supposed to be evaluating and testing does that reflect back into the health status that you see in Argo CD? At the moment you see how your deployment behaves so you see that all of your parts came up and that everything is ready. One thing I found out in the last week was that it would be nice to get back some status information from the application itself and this will be a thing we will be working on. Thank you. Thank you. We'll be at the captain booth so if you want to come chat to us as well we'll be happy to talk.