 Good morning everyone and congrats that you make it through a whole week of them and they're here on Friday in person to this talk So hi, I'm Giovanni Liva. I'm absurd. I am a TC member of the Captain project here and with bread unfortunately, Annemhea didn't make it to KubeCon But luckily we have here bread with us to Accompany me to this talk Hi everyone, my name is Brad I'm from New Zealand living in Australia at the moment. I First started contributing to captain about two years ago So that was with the V1 which we'll talk about the differences between you know the V1 and the life stuff life cycle talk it soon I Originally wanted kept in I was using Argo and I wanted to extend the capability of it with event-driven CI CD So I started adopting it. I Had changes that I want to make and then I realized how easy it was to actually You know make do a PR for the project So that's what sort of kept me going and then I worked my way up to a maintainer of the project as well and Now on the technical steering committee also so Has anyone here used captain? any hands Obviously Thomas as well. Yeah, he doesn't count okay, so It could be a little bit confusing, but we have two versions. So the first version was done on cloud events What we realized with that was it wasn't really proper cloud native so over the years we decided that if we want to To pay back a little bit of tech debt We decided to make the life cycle toolkit as the cloud native world was you know going crazy and we wanted to be more Kubernetes native of how we talked to it. So therefore we started the the life cycle toolkit So maybe I can speak a bit about what are the new feature of this life cycle toolkit and how it difference from V1 so V1 We created four years ago and the main problem back in the days in the cloud native names The cloud native space was oh look a new tool that I want to integrate in my pipeline and use it So we were mainly an orchestrator around tools that you can use to solve your specific use case But we noticed that nowadays Everyone moved into communities wants to be more cloud native and we were not there yet to support people Deploying their application to a Kubernetes cluster So what we did is six months ago We're branded the project and we start again and now what we try to solve is make sure that you can deploy your application In a community cluster in the best way possible So high-level speaking here you can see that eventually you Have some manifest that will hit your community cluster. This can be from a Qtl. Apply can be from our Argos sync or a flux Some or help me so somehow manifest will reach your community cluster from there We can intercept all of these manifest and try to create an application concept around your application The manifest that you deploy because right now you have singular workloads that You should be bundled together in order to create an application And we can read all the manifest detect how your applications should look like and we can run some pre and post check of your deployment I will go more in the detail with also a demo later to explain more I forgot I already have the demo. Damn it Damn it. How can I keep this? So let me do some cleanup So how it looks like so You have several manifest the only thing that you need to do to enable the lifecycle took it after you install it in your Kubernetes cluster is to enable it through an annotation on your namespace This way. Yes Thank you, let me start the demo so we can see. Okay, so the first thing is Is it good enough now? Is enabling your namespace now? Damn it I'm good at this So the only thing that will occur you to work with this is to provide the standardized labels or a notation of Kubernetes You should provide the app communities are your name part of and version with part of we can watch all the manifest That have been applied to your cluster group them by part of so we can detect how your application should look like and This is enough to provide you observability about what's happening inside your cluster Because we can watch everything and monitor and expose traces and metrics via open telemetry And If you want to do something more fancy like I was saying before to Run something before or after the deployment you can add also some extra notation on your manifest in the deployment manifest or any workload level To run something before pre-deployment test the task sorry or after with a post deployment task And you specify what type of deployment task you want to run and In this case, I want to run before my application starts To be deployed to check if a service called front-end is there or not. How does this task look like you can see in the? Manifest that we have CRD here because we try to be more github friendly so you can deploy this configuration next to your application The captain task definition defines how the task what the touch should do in this case It's a simple Dino function So JavaScript basically and you can either have it in line here just for demo purposes or you can also download this artifact directly from an Artifactory And you can see here we do nothing else than check an external URL and Wait if we can ping it or not an example for this could be for instance that you have an application My by front-end back-end and the database and database including this are hard So you decide to have an external service that run that up at the base for you but If you have some natural policies that prevent any outbound traffic from your cluster if you try to deploy the application Your back-end service will continuously crush loop because it cannot reach the database So you want to do maybe is to run a task before the deployment starts the check if you can reach this database if everything is fine then the deployment can progress and What do I mean with the deployment stops if we look into an example Have it here You can see that the for services that of this demo are still in pending state Why these because we hook into the scheduler To prevent any part to be bound to a node if the pre-checks are not satisfied This way we can block any workload to be scheduled to a note We block any container to run this way if something is wrong like this Database sample that I did will not Allow any deployment so we'll not create any noise any error In your cluster that your monitor solution that will scream at you to that you need to fix it and With all of this we also Monitor everything in your application because now we sit in front of the cube API So we can see why it's not centered. Yes We can monitor everything that is happening in your deployment So we can provide you with Dora metrics and traces about what's happening behind the scene when you try to deploy an application for instance in This demo you can see that we managed to run 9 deployments free are successful 6 are failing the time between the different deployment is 3 minutes and 75 77 seconds you can also see the different version. How long it took it to deploy and you can see here This is the example that I run and we can also Therefore provide you with a trace that shows why this deployment failed and This is all open telemetry So you just need to plug in the collector URL the plant telemetry collector And we can push all this telemetry data to the collector and then you can send it to Grafana Dinatrace that the dog you name it the platform of choice that you want and The cool thing is whenever you have a more complex deployment such this one is the example that works full-fledged You can then trace your application Deployment across stages because maybe you start in dev then you promote it to production But then in production you find there was an error and you can drill down back looking at the traces Why I didn't catch the issue in production also in dev and prevent some error to happening so you can then have a full trace across different stages and Figure out what went wrong Other things that we do is also have evaluations Since captain V1 the main usage of it was quality gate, which means I Want to test the quality from my application the lifecycle toolkit allows you also to run something after the deployment Went through because all the projects were okay So the deployment started the scheduler Allocate some nodes to the pods then I want to run something such as a low test and then verify If the response type of my application isn't a threshold that I'm considered a settable and for that we have here Captain evaluation definition that you can ship next to your application and here you can see a separation of concerns between Where what metric I want to look at for my application and What target do I want for my metric in this case? I want that When I deploy the available CPUs should be greater than one this is a stupid demo example, but you can have more fancy one for your real use case and Here the separation concern is very important because one thing is What target do I want for my metric and the other thing is where I do get this data from and Here you can see that the metric is coming from our different CRD captain metric Where we can say that this data is fetched from Prometheus and the query looks like this and If tomorrow is which from Prometheus to dinatrace datadoc New Relic You simply change the provider without changing the rest of the code so you can have an abstraction between where the data is fetched And what do you need to do with this data? Also here you can see that we have a parameter called fetch interval seconds because what we do is Try to bring observability directly into your cluster so far You have your monitor solution that monitor your application and your cluster and send this telemetry data outside somewhere else And then you always need to go through a different API To read metrics that you might want to have it next to your cluster to use it for scaling for instance So what we did is to create a captain metric server Which what he does is simply every 10 second in this example Let's go to Prometheus query for that value read the result translate that into Kubernetes native metric language So we extend the Kubernetes metric server And we expose all the metrics that you want natively into the metric server of your cluster this way you can configure HPA horizontal pod auto scaler Keda or Argo rollout to do progressive rollout or flatter with flux to use directly Kubernetes native metrics so you don't need to query an external tool for doing this Yeah, I think discover a bit the demo part So so far is a bit about the metrics ever how it works you can see here that we have a metric adapter that any tool can hook into that and Most importantly, we have several observability platform that we support Dynatrace, of course Prometheus and the other doc, but you see there are three dots because we are looking contributors, so please We have a country press later today At 2 p.m. Did the pee so join if you want to add your possibility tool to the list It's straightforward, and we are there to support you Something that you want to say Brad about this or yeah, so we have working group meetings on EU and US time zones as well, so it matches the time zones pretty much around the world if anyone wants to come It's a really good community to get involved with there's we have good first issues all the way to you know more harder tasks to work on and You can find are we gonna use the CNC of slack or the captain? Both are fine. Both are fine. Yeah, okay, so we have a if you go to Captain.sh you can see the slack channel and then we're also in the CNC of slack if you don't want too many slacks I Guess because most people knew we could start some questions and get some flow going there Yeah, does anyone have any questions so far? I mean we did go over a little bit fast with the manifest So if you donate us to slow down and explain it while we're happy to do that as well in terms of the providers I'm just curious how we You showed in I think then the first part of the presentation how we You keep the pod spending and I was working was how is how is it exactly implemented? How do you like actually tell the schedule? Yeah, I didn't went matching to the technical details But we have a booth if you want to chat also afterwards after the talk feel free to reach out to the booth How we do that we simply create a schedule plugin So we can basically our communities work is whenever tries to allocate a node to a pod Before it calls the plugin which is us and we said wait we need to wait for the pre checks to finished and then when they're finished We can say sorry the schedule should not go on because there were some problems. Oh, yes Now please community schedule work on that so we can work with any scheduler if you have your custom scheduler We still work with that but since this last release of Kubernetes Now the release Scheduling scheduling gates are beta and we plan to support this API to block any pot to be scheduled So we don't need to use any more the schedule a plug-in You're welcome. There's quite a few changes to the scheduler in the next two releases as well. So Some really good if you look at the enhancements of communities You can see all the the changes that they're doing. They're quite it's gonna make this very cool Yeah, yeah Any other questions? Yep, you can shout I can repeat it for this. I'll go. It's okay. Thank you. What use cases would you use this for? Thanks. Yeah, so as use cases we have observability of your application or better say the deployment of your application to see What's happening behind the scene when you send some manifest? Then prevent bad things to occur If you have automations that automatically deploys your application your cluster, let's say every time you merge a PR in main an automation kicks in and Tries to create manifest and Argo picks up the changes and tries to deploy in your cluster, but Someone is merging a PR while do you have a maintenance window for your cluster? That's makes sense to deploy the new application when you know that half of the nodes are unhealthy Most likely will crash loop and create a lot of noise and then create some ticket for some engineer to look into What's the problem now? But the way of doing that is to run something that could prevent a deployment with a simple check such as Look into pet your duty. There's any open issue is I'm in a maintenance mode Yes or no, then you can decide to progress with the deployment or stop it The second one is to support this quality gate Part where you can run also things after the deployment so you can check the quality of your application because Readiness and health props just say if a workload is fine, but not if the application is healthy So with this since we know when the application all the workloads of your application are completed You can run some tests to check the quality of your application and only if everything is fine You can maybe promote it automatically to the next stage So these are mainly the use cases that we are trying to solve Hi, thank you. I Would like to ask about how it looks like from the shy side so for example when some people deploy it and they don't do it Themself and what can be seen locks for example, how people understand that they need to go to the Grafana or trace look the trace and so on so maybe something happened and see I drop failed or Write something on the maybe it's demo time We will show but thank you. Okay, so if I get your question correctly is Where can you see if something is failing? I think Maybe if you show the road map the road map will have a lot of these answers to the question as well so Do you have the road map up there? Yes, it's also here if you want to look at it. It's public It's a github project on our organization And this is so really your question was To see that when it's failing what are you gonna do afterwards sort of thing like are you gonna allow self remediation Okay, see how After you apply your manifest if something goes wrong hold the candle developers go to the Grafana website and see what's happening That's a very tough question because most of the time you have automation that That trigger everything because you don't want to allow anyone to access your production cluster You want only a single you know automation that can read and send the manifest to that You know to prevent, you know, bad things or typos error human errors there And everything is event base so you cannot really Apply some manifest and then wait for the result and get an error code A better solution for that is to observe the event stream of Kubernetes Because every time something is happening communities amid some events and we also do that you can see in the trace So whenever you start a deployment a new nice a new trace Is it's created? So let me do another example and you can watch for that trace To see what's happening so you know that this is your website You simply have to wait you see the trace started you can then look at the trace and Wait until the different things are completed and this trace will be populated with all the information that you need or You can watch the Community events stream and we also emit the community event whenever something is happening or failing So in the example before that the database cannot be reached because of network policy We create an event an error event saying you cannot satisfy this task Yes, you're welcome any further question from the audience So maybe we can go over to the roadmap As you can see we start to work on The the thanks for the question before that readiness scheduling gates We are looking into using them, but we will not support them until they are stable But we are preparing for that That we stable and in two releases from now Then what we are trying also to work on is automatic application discovery If you forget one of those recommended labels We might still be able to detect your Application automatically or we try to create an approximation of your application just to make it easier for you to use our tool Then on the context propagation for github's this is a cool thing when you promote through github's you lose track about the previous PR that Created the things in dev then one everything is completed So one create a PR for the staging environment eventually for the production environment, but these PRs are not linked together We want to solve this issue propagating a trace context across stages So simplify the Propagation of the trace ID through different contexts and stages so you can trace everything back to dev Then since we were branded the project six months ago, we know our documentation are not the best So we are working hardly or making better But we plan also multiple things such as introduce the concept of instance Because now you have maybe the production environment, but you don't run a single instance of your production environment You have one in Frankfurt European Central one in North America one in Singapore So we want to have the concept of instance also for you and then Propagate context information for running this piece of code for hook your own tool with the captain task to propagate further context information so you can have more Information about what's happening in your deployment inside your task so you can make a smarter decision on what you should do and Yeah more Support for different type of Runtime to run your tasks because right now we support genome But we plan to support also Python and most importantly containers so you can bundle everything into an image a simple container and then we can you know simply run your container, so you don't need to type script or Script your way through But we have a lot of things in the backlog and if you have any opinion any idea any suggestion, please Reach out and comment because we are really looking forward to Have the community to support on our decision how we should do the key top propagation of the trace contact because every Organization as their own way of doing promotion and we would like to have the feedback from the community in order to create the best solution Yeah No because Promotion is a bit of controversial topic. There is no one size fits all every organization does that differently Someone go through open a JIRA ticket. Someone wants to have everything automated We don't have any opinion at the moment, but It'll be interesting if we can have a cheat chat about how you do your promotion Yeah, I've worked lots of promotion on the V1 as well. So I'd be interested and I know Thomas would as well It's a it's quite a complex thing, especially if you try to put rollback in as well Yeah, it's we've spent many weeks Of paying on that topic Okay. Well, thank you everyone for coming and yeah, if you have any questions or want to reach out, please And you know your use cases on alerting we can we can help you with that as well and we're here to help Thank you. Thank you