 All right, everyone, welcome back from lunch. So we're going to kick off this talk. This talk features me this time. The MC is actually doing a talk. So this talk is, why is it taking so long? So I'll introduce myself in case you don't know me or you haven't met me. My name is Christian Hernandez. I am the head of community over at Acuity. And I am an ArgoCity maintainer. I'm also open get-offs maintainer, a wine drinker. So I'm happy here in Paris. And Andy, go ahead. I know you like red wine, because last night. Red wine, yes. Red wine it is. So folks, if you want to get him drunk, red wine is the way to get him. My name is Andy Gravner. I wear my captain's t-shirt with pride. So do some other here, some maintainers that are here from the captain project. I do work for Dynatrace. And I've been in observability for the last 16 years, at least at Dynatrace. And before that, I did performance engineering for a previous company. And, yeah, observability performance, finding things, why things go wrong is an important thing for both of us. And I think that's what we want to talk about today. And by the way, these are QR codes. You can scan them. And now in the morning, people were not allowed, the staff was not allowed, people to take pictures. But you are allowed to take pictures, especially those QR codes get you additional resources. So please scan them, take pictures. Yeah, and with that here, and I can drive here. Oh yeah, that's right. But we want to do a quick poll. So while we were working with each other, it was really interesting that we found out that both of us speak the exact same three languages. And so we thought it was fun. It's like maybe we could choose you an adventure about what language should we do this presentation in. Neither one of us speak French. So would you be interested in hearing it in Spanish? Oh, there's a few people here. German? Very good. German, Austrian. Austrian is also an Austrian. Yeah, okay, there we go. Wow, definitely a ketten crowd then, right? Is a dynatrice crowd or in English? Okay, I guess we're gonna do it in English. I think. I just need to quickly double check. So folks that came in late, you may saw me struggle earlier. I just want to make sure that my demo is somehow coming up at some point at least. Still running. Because folks, please get off the wifi. Even though of a wired connection, I still don't trust the whole thing. So it's still setting everything up. Everything we show you later, in case things break, everything is also on GitHub. You can run the same code space, just keep your fingers crossed that nobody's messing with your system. But it's still starting up. So I have faith and confidence that this will later work. And more faith than I do. Okay, so I like to start off by talking about Argo CD in general. We are at ArgoCon. And I think it's even though, in the morning I saw a lot of people saying, this is my first ArgoCon. So maybe a little bit of ground work does help all the time. So Argo CD works great, right? And so Argo CD is really, I like to abstract Argo CD a little bit by saying Argo CD is great for handling state, right? And so people use it as a CD tool, as a deployment tool. A lot of people use it kind of like as a GitOps tool, infrastructure as code tool type of thing, right? But abstractly thinking, Argo CD takes care of state, right? And in that regard, it adheres to the GitOps principle, right? How about managing your source of truth in state, making it so, you have your source of truth, whether it be Git, whether it be Helm, Chart or in the future, OCI, if you saw my talk about OCI, there as well, taking that source of truth and making that into your running state and reconciling the two is kind of abstractly thinking. So kind of my point there is, it's like, yeah, it's a continuous delivery tool. But I like to think of it as a state management tool, more than anything else. And it has a great developer experience. I think most of you who have used it here, most of you have seen screenshots there, right? It has a feature rich UI and tool sets for the developer because that's how it was built. It was built with the developer and end user in mind about on ramping applications into Kubernetes, right? And so, and in a lot of ways, you can see ArgoCity as a very, very early, one of the very early representations of a Kubernetes UI because that's kind of how it was built to abstract away Kubernetes. So a lot of you have heard in the earlier presentations if you hear in the morning that ArgoCity ranks number three in the CNCF, number four in the Linux Foundation, which is to me nuts, but how quickly that blew up. But it actually, ArgoCity ranks number three in observability tools, which is very, very interesting. But when you think about it from a way that when you're looking at ArgoCity, when you're looking at the UI, when you're kind of looking at the status of it, that is kind of observability, right? And if you don't believe me, there's a link there. I tried to get a bitly going, but kind of ran out of time. But I like to say that we would all love the happy path. We would all love that everything be stateless. We would all love everything to be on Kubernetes, but that really isn't the truth, right? Observability goes beyond just GitOps deployments to just Kubernetes. It goes beyond just that, right? Your view of the world goes beyond just Argo's view of the world. Argo's view of the world is very, very narrow. And so I will, so a lot of the times Argo does work really great until it doesn't, right? And you kind of get like this, you know, a little broken heart, right? Little degraded things there. And it's because a lot of the times people run into other factors that you have to take into consideration when doing deployments, when deploying your application, because your application can be made up of different parts outside the purview of Argo CD. So, and when people think of like, Argo CD sinks of it being good or bad, they think of in terms of failures, right? We always, our minds right away go to failures. And other things may happen that may be detrimental to your application deployments, right? Things like sinks get slower over time, right? You know, for example, there may be a noisy neighbor, right? Eating up some memory that maybe you don't know about on the system that you're deploying to. Failures without changes, right? Someone makes an out-of-band change. Someone makes a change like on a system that makes changes to your Kubernetes cluster. You know, me, I'm a get-offs, a zealot. I'm a get-offs purist here. But, you know, people don't see that view of the world and they don't use Argo CD only for get-offs. So there's other out-of-band changes that can happen on the system. And so there's failures and you don't, you're like, I didn't change anything. The source of truth is still the source of truth. And other things like system becomes unusable, right? We also, maybe DNS goes down, right? We all love to blame DNS. And just really my point that I'm getting to is that failures and slowness and issues can come from things beyond just the Argo CD application health, right? The Argo CD has a, as I said before, a very narrow view of what it's deploying. It's very good at looking at what it's deploying and what it handles, but there's just other things outside of Argo CD that you need to worry about. So, thank you. So, kind of bringing this all together, right? Better together, right? Ketin, Argo CD, better together. I've been working with Andy here for a while and I've been learning about Ketin more and more as I go along. It's great to tie into your Kubernetes cluster, right? So Ketin, what it does is that it kind of gives you a trace observability onto your entire deployment. So that way it can bubble up things via open telemetry so you can build things like door metrics, right? Build things like Grafana dashboards. I'm probably missing some things. I'm staring at the Ketin Dynatrace people staring at me and I'm like, hopefully I get this right as I'm learning about this journey. And so it does, it has a better view of your entire system. And it does this natively within Kubernetes. And it's mending broken hearts. And it mends broken hearts. Yeah, there was a broken heart and now there isn't a broken heart. And what's really cool is that Ketin is, once it's installed, is as easy as annotating the namespace. And so then when you annotate that namespace, you then get this trace coming down to your system about everything that goes on in a deployment, in a, I guess, application. Then I need to distinguish between a Ketin application versus an Argo CD application because here in tech we love to overload terms, right? Same as with projects. See, everyone calls everything the same thing. But in a Ketin application is made up of these coupled components, right? That could make up a lot of things, not just Argo CD applications themselves, but the components in our Argo CD applications, components outside an Argo CD application, you kind of get a trace of all the information of a Ketin that allows you to be an open telemetry. And so, yeah, next one here. Yeah, and then you can build cool door metrics like this way. Right, so you can see how many deployments succeeded, how many active deployments you have, how many have failed, how long between deployment sites. So you can start building these things. And Ketin really, and I think Andy's gonna go over a little deeper in a little bit where it kind of becomes like a generic interface for telemetry in your Kubernetes cluster because you can just hook it up into a different metrics provider. Now you have this like convenient way interface into all the telemetry data that you have in your deployments. So I think it does that. Yeah, so this here, I didn't realize this animated. Okay, I should have. And so really how Ketin works is that it extends the pod scheduler, right? It works with the pod scheduler. So no matter what happens on the system, whether Argo CD does something, whether a CI system does something, whether I go in and do a Qubectl, do something, Ketin knows about it because it actually works with the pod scheduler, it actually works with that controller so that way you can have an end-to-end view of your deployments. And so whether it be me launching a pod, whether it be Argo CD doing a rollout of a deployment, you get to see that entire history, that slide before where you'll see the traces. You can see how long something took, the history of it, you can bubble all that up to Grafana. And maybe one of the things, so folks, if you're deploying a single pod, it's easy. If you're deploying an app that is more complex, like here, front-end backend storage, what Ketin does, as you explained, we are hooking into the pod scheduler with an automatically creating spans who is familiar with open telemetry, traces and spans. So we're creating a span for every deployment that happens. But we also look at metadata, all of your workloads. So if you annotate your workloads correctly so that Ketin knows these three workloads actually belong together to a logical app, then we are creating not only spans for each workload and not only metrics about how often was a workload deployed and how long did it take. We also automatically understand what the logical application is around these workloads. And for this, we again then stitch together all of these spans into one end-to-end trace showing you how long did it take to deploy one, two, three, five, 10, 100 workloads that belong together to an app. When did the deployment start individually for every workload? Where did it fail? Where did it go wrong? And I think we heard a lot about earlier today. I mean, as you said, Argo typically works well until it doesn't and then you need observability to figure out what's wrong. And this is really what we're doing with Ketin. You just install it as an operator. You annotate your namespace and that's it. And I think now. I think now we crossed our fingers. We crossed our fingers because for the folks that came in late, I prepared the demo earlier on GitHub by using everything that I, if nothing works today, right? Then at least please try it yourself because the GitHub repository is out there. It's just code spaces that I'm using. And let me just quickly figure this out. Cube CTL get pods. So you see what I done. Everything is running now. That's good. Is it big enough? Or shall I zoom in a little bit? Don't hear anything? Maybe just a little bit. A little bit? Okay. So what it does, it installs Argo and Argo is then installing Ketin, Grafana, OpenTelemetry, Yeager. And what I've done here, let me just open up my Argo. And first lesson learned. First of all, admin, admin. Yes, it's default. It's not secure, but it's also really annoying because then you always get these popups. Yeah, that's why you get the warning. This is not okay. So this is my Argo. Argo has deployed Grafana, Yeager, my Prometheus take and also Ketin. Now what I wanna do here, this is also all explained in the demo. I wanna quickly deploy my sample apps because in the end I've prepared Cube CTL apply and I have the simple node app. Cube CTL apply. What's it called? Simple node. Yeah, my next show. Here we go. And then almost, almost Cube CTL apply minus F, simple node. And then after the other one, Cube CTL apply, it's the so-called Potato Head. Paul Potato Head. Here we go. I wanna show you now what's happening behind the scenes. First of all, we obviously have Argo picking it up. It's going to deploy Potato Head, but Captain is enabled on that namespace. What does this mean? If I look at the namespace definition of my app, sample apps, Potato Head under the deployment, and I think I'll zoom in again because it's even small for me. This is the problem when you're getting old. So the only thing you really need to do after you've installed Captain on your cluster is you just annotate the namespace. From that point on, Captain will monitor and observe that namespace. It looks at deployments, replica sets, stateful sets, I'm sure I'm forgetting a couple of other CODs, but it's then automatically looking into is there a version field? Is there a part off field? With the part off, Captain actually knows that the Potato Head entry belongs to a logical app, Potato Head Head. And then there's a second deployment that I have here. This is the head of Potato Head. So Potato Head is one of the sample apps that is out there in the CNCF space. And with the annotations, Captain really knows that all of these different workflows belong to one logical app. And I would like to get an end-to-end trace of everything that Kubernetes does to deploy all the workloads that belong together. Now, additionally, what we've just introduced is a so-called CaptainContext object that I've also created here. CaptainContext allows me to add additional metadata that will later be added to the open telemetry trace. We talked about this yesterday. There is more happening before and after Argo is doing its job. There's going to be GitHub actions that are building. So there's CI. If you are creating traces of that CI and you can add this trace ID to a CaptainContext app context object and check it in and deploy it with your app, then Captain will create a trace but link it with the trace from your CI. The second thing that happens, we talked about multi-stage deployments. If you have dev staging production, you can just pass in the promotion phase the open telemetry trace ID from one stage to the other and then Captain will create traces that are all linked together. So you can really finally get from first code commit across all of the different stages, including promotions, you get one end-to-end trace. Yeah, and that's kind of like what we're talking about more obscure things outside of Argo CD because Argo CD doesn't understand the relationship between environments. And there's things like building the artifacts and all of that that you have to take all that into account for your deployment, not just like the very, very tail end of the deployment, but everything leading up to it and everything after it as well. Cool, and now that's the magic moment from earlier for those people that were here earlier and witnessed this. First of all, admin, admin is still not a good idea. I know, but I'm the only one that can access this Grafana and Argo. So I am pretty safe. The magic moment, drum roll, if it doesn't work, I quickly switch over to my screenshots. Okay, so here we go, no data, but there is actually data because what we have is a trace. So here, ladies and gentlemen, this is an end-to-end distributed trace that Captain has generated for me when I was deploying Potato Head, showing me that it took 20.41 seconds to deploy the whole thing. It gives me an overview of every single workload that belongs to that app. What I also see, I get those attributes. Remember earlier, I had my commit ID, my author, Grafana Andy. This is all the stuff if I go back that is in here. So that means you can actually enrich the traces with additional context information that makes life easier later on for troubleshooting. Who actually committed this? Who owns this piece? So we automatically get an end-to-end trace. Now, I wanna do one more thing. I wanna change my app now because now I'm really feeling good with the demo gods. I'm going to deploy just a different version here, 301. And then for my second piece here, I'm just saying this is 313. And I am just committing my changes. Live too. You're doing it live. I'm live, yeah, I am live. This is not fake. Let's see if this works. I'm committing the changes. I know what I'm doing. So the idea is now with every change that gets made in my Git repository, obviously Argo does its job, right? Because it should figure out it's auto-sync. And I am one of the brave people that do auto-sync. So it will sync now, but in the background, Captain will start tracing all of the deployment changes. And in case something breaks, I will also get the information about any problems that happened. The image couldn't be pulled down or anything else happened. Typically, when we would have more time, I would run from multiple deployments, some that fail as well, this just takes a little time. But if you, I hope you believe me, that if you do this, let's say multiple times and you run through the different scenarios, you really get not only traces, you also get your metrics. So Captain is not only exposing open telemetry traces, but also Prometheus metrics about deployment time, deployment frequency, also time between deployments. So you can see here, you are at least a basic set of your Dora metrics. And the only thing you need to do is annotate your namespace that it should be observed by Captain. And you wanna make sure to follow best practices on Kubernetes labels and annotations, put the version field on there, put the part off, and that's it. And if it fails, you also get obviously traces that show some failures. And the live demo, I would say, it worked on the second, look at this, I got a second trace now, there we go, second trace is here. Three, one, ah, that's awesome, nice. Good, how are we doing on time? About three minutes. Three minutes. That means to bring it home. There is more that Captain can do. We are not only creating traces and metrics, we also have the capability before and after a deployment to do what we call evaluations and tasks. And there is, we talked about this during lunch, we said about, can we maybe do some policy checks before we actually allow things to deploy. Captain could actually run a task pre-deployment and say, do we have a healthy environment? Do we have all the policies there? Are the images, whatever, check, but whatever needs to be done. So we can do pre-deployment checks and pre-deployment evaluations. If those fail, then we are not allowing Kubernetes to deploy the containers. And post-deployment, same thing. We can pull in any type of metric from any type of observability tool that's also very important. We can pull in Prometheus data, data from Dynatrace, from Datadog, from Neuralig, from any observability platform, and then specify SLIs and SLOs to validate if a deployment is not only running, but if it's also healthy. We, I think, will share the slides just for all the stuff that I just mentioned, some of these additional capabilities. On the left, the analysis definition. So this is really where you can specify what is your objective. So which metrics should be analyzed against which objectives, and then Captain gives you a health indicator. In the middle, the Captain context, I mentioned this, this is, as I said, really cool for also cross-stage tracing. And then Captain metrics in general can not only be used for SLO validations, but we can pull in these metrics into Kubernetes from external observability tools and you can use them for CADA, HPA, or also for Argo rollouts. Some additional resources, QR codes, GitHub demo on the left. So just fork this repo and then pray to the demo gods when you click on create code spaces that it works as beautifully as I did on my end. What's the free Argo CD training? Yeah, so Acuity offers free Argo CD training, sign up and get hands-on with Argo CD if that's one of the skills that you want to elevate for yourself. So free training on the Acuity Academy, we're always adding things to that, so feel free to check that out. Cool, and then the Captain website, very easy to find as well. And then the last thing, because we have been seeing that people are really asking for more observability in Argo and Captain is fulfilling already certain use cases, we opened up an issue on GitHub 355 that Arlo has opened yesterday about learning about additional use cases. So in the morning, we had a lot of talk about Argo workflows. If Argo workflows fail, I also want to get some telemetry, maybe some traces. So I think this could be an interesting additional use case. But we would like to hear from you if you have any additional use cases where you say in your day-to-day life with GitOps, with CD, with Argo, you would need some additional level of observability. Is this something where Captain could, for instance, help and give you exactly that data that you need for better troubleshooting, for better reporting, for better understanding what's going on? So 355, that's the issue, please comment with your use cases or just give us your thoughts on what you think in general. And with this, thanks and open up for questions. Yep, thank you. Yeah, there's mics I believe up there if you have a question. Yeah, I know it's a long walk, sir. Hi, I work at iFood in Brazil and we are just doing the Argo rollouts for Blue Green and everything. And one of our issues is really tracing everything up from like who did what in the promotion phase. Is Captain, can Captain help us in this way? Like can we do like all this part of tracing with the rollouts too? Or because we are not using Argo CD for deployment, we're just using rollouts for that part of Blue Green and Canary. In general, Captain is agnostic to however you deploy. Because we are extending, we're hooking into the pod scheduler and so if you do a Cube CTL apply and do something or if you deploy with Flux, with Argo, it doesn't matter, Captain will be able to see this. And there's a nice demo that the team also did here, a ground, it's called the cross stage promotion demo on the Captain website. It's fully documented, so check this out how you can really trace from one stage to another, including promoting promotion. Thank you very much for your talk. All right, I think that's it. Yeah, thank you. All right, thank you everyone.