 Yn meddwl, mae'n gael i'r rhan, y cyfnodd yn ymdillogau a'r cyfnodd yn ymdillogau yn Argo, Fflux a'r captain. Mae'n cael ei wneud, ac mae'r Meddina yn ymddangos. Mae'r cyfnodd yn cael ei fod yn gweithio ar gyfer ei gynhyrch, ac mae'n gweithio ar gyfer ei gweithio, ac mae'n gweithio ar gyfer eu gweithio. Ac, sefydlu, y cd-f outdoor a'r CNCF ddaf ar-bathoddau, rwyf wedi gweithio ddau 9 yr hyn o'r cyfnod arferfwyr. Rwyf yn ei gweithio'r cyfnod ar-barthau cyfnod. Rwyf wedi gweithio'r cyfnod ar-dynod a'r cyfnod ar-fwrdd. Mae gweithio ar-dynod a'r cyfnod ar-dynod, ac yn oed yn y ddweud o ddiwrnod, mae'r oedd yn y cyfnod ar-dynod o'r oedd y cael yw ychydig. Mae ydy'r ddw'n ddwy ffotos yw hyn ar gyfer, mae'n ddweud fel ysgwrdd. Mae'n sgwrdd eich ddweud o'r ddweud o'r ddweud ymlaen a'r gweithio'r eich ddweud, ac nid i'n gweld y ddweud o'r ddweud o'r ddweud, mae'n ddweud o'r ddweud o'r ddweud. Felly, dwi'n ddweud o'r ddweud o'r ddweud o'r ddweud. Mae'n ddweud. Mae'r ddweud o'r dechrau gyda'r dyfodol ar gyfer. Mae'r unrhyw ffordd cwrs yw'n ddod ar ymddangos. Gwysig fod yn diddewyr llyfridd, oed ac ddwy'r llyfrdd yw, mae'n gofio'r rhai pryd o flynyddiannau ymdrytiwn ac a'r hyn yn rhan o'r ffordd a�a ar draws i'r ymddangos. Ar y pryd cyfnodau'n meddwl, rwy'n credu ar hyn pwynydd yn ysgrifennu a'r probl yw'r eu cyfnod. First isined option. We'd all love it too have a single tool to go for our metrics. That's not the case. In an enterprise you'll have Prometheus and Dynatrace, nd another Prometheus instance and Splunk, and so on and so forth. So that's that's problem number one. How do we get a unified view and treat that data in a unified manner? The second is, when we make a change, how do we actually measure that? Tools like Argo are very, very good at doing a deployment. But how do we actually measure whether that worked and whether it was successful? And then the third is taming the integration complexity. The talks that I've heard the last two days have all been about we're maintaining this point-to-point integration with this tool. And then we build it again with another tool and so on and so forth. So challenge number one, the observability tool sprawl. Just can I give me a nod or a shake a yes or a no if that's an issue for you? Yes, yes, lots of nods, okay. So the solution that we have come up with, of course, is the captain metric server. Basically think conceptually about the metric server as a cache that lives on your cluster. What it will do is it will sit as an operator on your cluster and you define your endpoints of where your metrics are. They could be Prometheus, they could be Dynatrace, they could be Datadog or Splunk or wherever. The metric server will go out and retrieve those metrics from that solution and pull them into the metric server, as I say, like a cache. And then once they're in there, you can treat those all the same. You can just say, get me a captain metric. You don't really care where the metric actually came from. Now of course that opens up a lot of other stuff and other capabilities like HPA and CADer. So you can drive those tools with the captain metrics. And this is what it looks like. So you have a Cube CTL get captain metrics. And you can see there we're pulling one from the provider Dynatrace and one from Prometheus. And again, you don't care where those metrics came from. How do you actually define that? What does it look like in Yaml? You won't be surprised to know there is a custom resource behind this. That's top on there is the Prometheus. So provider and then you give it your query. And obviously the query is tool specific. So the query for Dynatrace will be different to the Prometheus query. You can also see there you can define the fetch interval seconds. So they're both 10, but you can say, well fetch the metrics from Prometheus and fetch this metric from Prometheus every 60 seconds. So there's that flexibility as well. And then that's just an example of how it could actually be used with HPA. Here you're saying, well I've got HPA. Here's my metric, my availability SLO. And I'm pulling it from a kind captain metric. And here is my target value. And if not, HPA will then scale accordingly. Again, remember that most of what I've talked about is the plumbing. As a user you're just going to be looking at this right hand Yaml saying get me a metric. You shouldn't care where it came from. That's the abstraction that the captain metric server provides. Challenge number two. I made a change and well great now what? Argo flux has applied the change. But yeah, so what? What do we do with it? Let's see whether that was good or bad. The lifecycle toolkit gives you full observability of everything that happens on your cluster at the deployment level. Because it emits open telemetry metrics and open telemetry traces. So as Argo does a sync, we build an open telemetry trace of what that looks like. And I'll show you that in a moment. And then of course there are graffing solutions like Grafana. You can build dashboards for your Dora metrics because of course if we are wrapping. If captain is in that deployment process, we know how many deployments you've done. We know how many failed. We know the time it took to do the deployment because we've got the open telemetry trace. And at the bottom there of course is the actual open telemetry trace. And I'll zoom in on that a little bit later. And the third one is this idea of point to point integrations and trying to solve that problem. So you've got test tools, you've got security tools, you've got your SLO validations. Am I allowed to deploy? Is the infrastructure healthy enough? And all of these are kind of, you have to maintain them. And captain is trying to solve that with the concept of captain tasks. So this is an example of a captain task. At the moment captain only supports JavaScript and it runs via the Deno or Dino runtime engine. As you can see there you write these tasks so they can do anything you want. Now an evaluation looks like that. Obviously again a CRD where you give it an objective. A captain metric as I described earlier like available CPUs. You give it an evaluation target. And then basically you link that into your deployment via annotations. And it either passes or it fails. And if it fails of course we're not going to do the deployment. So how do you get those top two yamls into your actual process? Well it's very simple. It's just a case of labelling up your deployment manifest. So here what you're seeing is annotations that say pre-deployment. We're going to do these evaluations. So these are things like checking our third party systems are healthy. Our dependencies are healthy. Our infrastructure is healthy. Whatever conditions must exist before you're allowed to deploy that app. Maybe it's a maintenance window. Then we trigger the pre-deployment tasks. And in this case it's whatever you've written. But let's say we're going to notify someone on Slack that that's going to happen. Then we do this exact same two steps post deployment. So after the pod is scheduled you have the ability to run your own tasks. Like notify that the deployment was successful. Or rollback or whatever the case might be. All right enough slides. Let's see this. So this is the Grafana dashboard that comes out of the box when you install captain and the observability piece. The lifecycle toolkit comes as a helm chart. And then of course you need something like Jaeger or Dynatrace to actually visualize those traces. You need a graph, a dashboard solution like Grafana. But the actual metrics, the Dora metrics are emitted by the lifecycle toolkit. So there's nothing to configure there. Now down the bottom if I zoom in a little bit we also see a list of traces. Now as you can probably guess this trace is a deployment. It is an open telemetry trace of a deployment. Now if I zoom in what you'll see is there are two sort of just a shape of things. There are two sections here. There is your application and then we kind of go into the tree and you've got your workloads. So think about your workloads as your deployment yamils and the application is a concept in captain, a CRD called a captain app. And what that CRD does is it allows you to say this set of deployments is logically to be considered an application. Now all of those workloads, those checks that I was talking about happen on both the workload level and the application level. So not only can you test your micro services whether they're individually healthy but at the end of all of that is my application as a whole healthy because it's no good saying well yes my cart service is okay and you know they unit tests really at the workload level and then you go to the application level and say can my user do the thing they need to do? And that's what you're seeing in this tree view. So we go into the pre-deploy tasks. This long one here is actually the deployment. So this is Argo applying the deployment and then we do the same thing at the end with the post tasks. So how does this actually work? Well we start by installing the toolkit. Now it's important to say it is a toolkit. So if you like the sound of the open telemetry traces and metrics you can have that bit. If you'd like the combination of metrics so that you can have just that bit. So you can pick and choose which part of this you want but ultimately you have to annotate your namespace. That's going to say to captain okay start taking notice of this namespace. So this is the application namespace. Then we have a completely standard deployment manifest. Nothing special at all apart from. We've put in some, I'm not sure why this is commented out but it works with app as well. So three labels app or name, version and part of. Now the part of is the application level. The name is the workload. Those three annotations are what drive the toolkit. Those three are really the only thing you need because that's the information we need to build that picture. What is our workload? Well that's the name. What application does it belong to? That's the part of. And what is the version of that? That's the version. And that's it. That's basically it. So you run with Flux or Argo or whatever tool that you currently use and the lifecycle toolkit just hooks into that deployment and does the rest for you. How are we doing for time 302? This is a sample of a captain task definition. So as you can see this is in line so you can actually write anything you like. Now as long as these tasks exit with a zero captain says that was successful, fantastic. On we go with our day. If they exit with anything other than zero obviously that's indicating an error. Now if your pre-deployment tasks fail. So let's say you're trying to do a deployment and you go out and check your third parties. They are unhealthy for whatever reason. And you've done that as a pre-deployment. Your workload will never pass. It will never be deployed. It will stay in a pending state and that is by design because what you're saying there is my third parties have to be healthy otherwise I know my application is going to be unhealthy. Now post tasks of course the deployment the pod is already scheduled so if a post task fails the worst that's going to happen there is you're not going to get your notification in Slack for example. So what about the captain evaluations? I'll show you one other thing which is a missed apart. How do we get the pre and post evaluations to actually trigger and actually work in the life cycle? Again they're just annotations. So here we're saying captain.sh pre-deployment evaluation. Evaluate my dependencies. Pre-deployment tasks. Notify someone via Slack or webhook or whatever the case might be though we're about to start the deployment and then do the same thing afterwards. So this is what a captain evaluation looks like. Again completely GitOps compatible. We have called it evaluate dependencies now remember back here this is how it ties in. So that's the name of the object. So that's how they tie together and we have a captain metric reference as you saw the metric server evaluate CPUs and we have an evaluation target and that's partly why we've built the captain metric server because we didn't want to pollute this with the idea of providers. We don't want to have because remember different teams are probably going to be setting this up in a large enterprise. So one team is going to be looking at the metrics and the thresholds another team is going to be looking at the providers. So we didn't want to have you to have to know well this metric comes from from Ethos and this one from Dynatrace and so on and so forth you can treat them treat that transparently. I really want to open the floor questions because I know the lifecycle toolkit is new to a lot of you but actually that's a good question how many are new to the toolkit? How many had only just heard about this now? Good. Hopefully there'll be questions. So in summary the three big capabilities are a unified way to access observability data. They're the captain metrics so no matter where you're coming from what tools your metrics come from that's the captain metric server. As you saw all of this is generating Dora metrics and open telemetry traces sort of quote-unquote out of the box that's the lifecycle toolkit. That is what's wrapping your deployment. Now it's important to mention that it is at the cluster level so captain is completely agnostic to flux or Argo or Cube CTL apply or however you decide to do that deployment captain doesn't really care because it's actually a pod schedule extension that that's when it starts is when the pod is actually beginning to be scheduled. And of course I talked about the idea of tasks, custom tasks and the SLO evaluations and all of this of course as you've seen happens natively in Kubernetes with CRDs. Questions? So the question is how does the lifecycle controller find out the time to recovery metric because I guess the version would change there. Is it the whole application that it looks at, just part of it and so what's considered an incident and when is that remediated? That actually I'm very glad you asked that. That brings up two areas that we need assistance from the community with and advice and opinions on is exactly that time to recovery but also promotions between environments because what I've just described there is I'm deploying a set of YAML files. That is an area that is under active discussion so if that's of interest please do join the community, get involved, have your opinion etc etc. So when the application is initialized for deployment so metrics are not available for application. How do we set the lifecycle let's say for healthy recovery or healthy deployment? So let me show you the definition of a captain application. Now this step is actually it's recommended but it's technically optional. So what I'm doing here is I'm saying I've got a logical application called simple app it's in this sorry I don't know if that's still working it's in this namespace the application has a version 1.0.2 and it has a set of workloads and each of those workloads has a different version. Now it's recommended to do this obviously with GitOps you want to be declarative but if not my belief is that the lifecycle toolkit will basically look at the annotations on your deployment manifest and build this for you. So that's how you define an application and then you can also decide your pre and post evaluations and tasks at the application level in this YAML file. Your question was more around the health of an application. So the toolkit hooks into the scheduler so the first time the toolkit really cares is when the pod is about to be scheduled and we basically say no no wait a minute we know you're trying to schedule some pods but let's go and do some tests and tasks and evaluations first. If they pass go ahead Kubernetes and schedule the pods that's fine. So does that answer your question or not really? That answers but maybe to be more specific so application pre deployment checks can check for computer resources or let's say memories availability is there or not. How about those scenarios where we want to set up let's say ingress route and the controller is down. Does it check the availability of the route? I mean can we have that capability? The short answer is yes. Have we got time? Yeah we've got time. Okay so in that scenario you would mix the two concepts remember you've got your evaluations that pull from the captain metric server. So the first thing is are some metrics available in the captain metric server that we could pull and judge if they are perfect. If not then remember you've got you've kind of fall back to the tasks and those tasks are just javascript so now the world's your oyster you can do whatever you like as long as that task exits with a zero you're signalling to captain that it's it's you know so that's using a task would be how you would reach out to a CDN if the CDN didn't have metrics and say are you still available before I do my deployment? You do you know a fetch you'd make sure their endpoint is up all good that's the mic's close. I got the mic first. Oh yes you did. Is there a distributed way to run captain it's sort of like a management cluster level or is there architectural reference for that? Not that I'm aware of the only deployment models I've seen at the moment are a single cluster because you deploy the toolkit per cluster I'd be very interested in working on a multi cluster set up a multi region well multi cluster I guess set up and and even having that a Grafana dashboard to visualise that that would be fantastic I'd love to have that. So short answer no not yet but let's connect and build that. Would the pre and post deployment a variation also trigger for upgrades if I update the deployment definition with a new image for example? Yes well yes and no so it works on the version so you have to bump the versions in your deployment which you probably be doing anyway in reality so yes as soon as that version is upgraded because what I haven't shown is behind the scenes obviously as soon as we start this process we set up a load of custom resources on the cluster so you can say cube CTL get captain apps and I'll show you the current version it'll show you the previous versions and so yes yes it does. So I have to bump the version of the captain app it's not enough to change just the deployment which deployment spec? No just the workload. Any more for any more? So now I got the nod or the shake at the beginning now that I've said that said all of that does this sound viable or as a project are we fundamentally missing something here? Are we missing a problem that this solution could solve? Fantastic everyone's tired there's been a lot of talks I should have got you all to stand up and do some sort of which two or which one isn't the first two? It might be just personal preference but tasks just doesn't I don't feel like that's the but I haven't tried them yet. No, no What I think being here and hearing projects like Ortilius we could have tasks that push and notify into their repository to say you've got a new version after a deployment here's the image and the char go and stick that into your registry or even just notifications I don't I was talking to the tecton folk today maybe we trigger a tecton task the possibilities are endless really Like filing a bug or something essentially when a test fails you could automate the creation of a work item or something like that. I do want to say one thing that really caught my attention when I mentioned with Kempton over the last couple of months has been the integration trace test and seeing that adds a very visual layer and I guess my question here is with the life cycle controller and the metric server do the trace test run essentially with Kempton do those show up in the same metrics capacity or do you see them in sort of the open telemetry like door metrics as well so we've built how are we doing oh good we've built trace test I'm in contact with the trace test or the fantastic team and if you haven't seen trace test definitely check it out but we've built it for version one, Kempton version one which is a completely different thing to this we haven't yet embarked on that journey of building it for the life cycle tool kit but I think it will be a task and potentially a task that then pushes metrics into the captain metric server and then you can do evaluations on it so yes I think we haven't solution it yet to be honest but I think and it makes sense it makes sense to have all of this in one place so because your trace tests and it's not just trace test data dog, data trace we'll want to pull traces from their back ends and evaluate traces as well I guess that is like the other problem with trace test which is that it is not multi cluster or done it sort of like a management level it really has to be run next to the applications because of that it's kind of the same thing here unless you're forwarding your telemetry data into an hotel collector into some other centralized observability plane you're not really visualizing or seeing it from a hub and spoke model exactly, please do sign up to the Slack and raise these questions and thoughts this is exactly why the community is so powerful so thank you very much everyone and yeah