 Welcome, everyone, to the presentation about captain. I'm Giovanni Liva, a captain maintainer since multiple years. Next to me, there should be another face you can see here. Anna, unfortunately, she got sick, but she's with us in our hearts. So today I would like to talk about why you should use captain and why captain is great in day-to-day work for you. And the RDR for this question is to make sure your deployment's Kubernetes simply work. But isn't deploy on Kubernetes very easy? I just need to apply some YAML manifest, and then Kubernetes will take care of everything. It will spawn up the resources to end of the pod and the service, and everything should be fine, right? Or if I want to be more production ready, I can set up some GitOps workflow where Argo continuously syncs my manifest from my set repo into my cluster, and Argo keeps track of the state changes and everything is done. So deploy on Kubernetes should be easy, right? Right? Well, if you use Argo, I think you saw this quite broken heart a lot. And the reason for this application of degraded are multiples. And if you use Kubernetes since quite a lot, you will know that are unlimited the way that you can fail. Some are quite common, like this one. An OCA registry doesn't contain the image, so the container cannot be downloaded. My Kubernetes doesn't know what to do. Just it will stack and do nothing for you, so you need to discover this and try to fix it. Other type of problems instead are much harder to debug because they require knowledge about all the different manifests are related to each other. For instance, if I have a config map mounted as a volume and the config map doesn't exist, then the pod is stuck. Kubernetes doesn't know what to do because the config map doesn't exist. So for you to troubleshoot this type of errors, you need to understand how the different manifests relate to each other, know which config map it mounted in which pod, and check if it exists or not, and find out why it was not created. Other type of errors instead requires much more deeper knowledge about Kubernetes. For instance, whenever there are scheduling error, it requires you to know how the inner details of Kubernetes work. For instance, I don't have any more IPs in my cluster. My node doesn't have any more available IPs, so the pod is simply stuck. And in order to understand why this, you need to understand the inner details of Kubernetes, how the networking layer works in order to figure out, ah, maybe I need to increase my subnet mask. But if you managed to have a successful deployment, doesn't mean that tomorrow everything could fail because maybe I have an application that connects to an external database, and today I deploy the new version, everything's fine. But tomorrow, another team from another department across the globe from another part of my organization apply another set of manifest with some config map or network policies that screw up the connection to the database. Now, my application doesn't know what to do. It will crush loop and require a lot of work to understand what went wrong. But beside common troubleshooting problem on Kubernetes when it comes to deployment, there are also issues when it comes to processes. Because after you make a deployment, you also need to understand if the new version of the application that you deployed is healthy. Before, we saw an app degraded from Argo, and that's a definition of application health. But I guess that having the application just running does make much sense for all type of organization. So every type of application has their own different version of application health. And we cannot just rely on pod health to provide that concept. Also, if I know that my application is healthy, I would like to automate the progress of the artifact that comprises my application across different environments because I don't want to manually apply or create PRs across different state repository of Argo in order to promote across different stages. I want to be fast and provide the things in production so my customer can make use of these new cool things that we built on. And also, I guess if you try to troubleshoot one of the errors that I showed you before, there are a lot of noises in Kubernetes. There are a ton of events and logs that you need to read through. And a goal is to improve the signal to noise ration in order to know where the problems are in order to focus only on the problems that really matter. And how we can address all of this? Well, we are at the KubeCon, so observability is one of the easy guess. And we are at Captain Talk, so Captain is the second part of it. So if you combine observability and Captain, you can solve all of the previous issues that they present you. So the first part are quite trivial. With observability, you know the root cause of the failure. Let's focus more on the process type of issues, like the application health and not just relying on pod health. So to set the stage, pod health is a concept provided by Kubernetes and usually relies on two special props. You can see on the right hand side the liveness, which is currently failing. The liveness probe is nothing else than the cause of the Kubernetes to say, can I try to restart the pod? Have you tried to turn it off and on again? Nothing else than that. Whenever there is an issue with the container, Kubernetes just try to restart it. And if the application instead says, ah, I'm live, then Kubernetes will try to also fetch the readiness probe. And if it gets a 200, HTTP 200, then we simply redirect some traffic to it. If it's not a 200, we simply don't do anything. So sounds good. Can I just use props to define my application health? Well, that's exactly what Argo does. But what if Argo says my application healthy, but I just deployed my new version and now the response time for pressing the payment button in my card takes five seconds to load the next page instead of 300 milliseconds. Do you consider that application healthy? I will say no, because low is the new downtime. So let's set up a stage for a small demo here. My awesome application, 0.3.0, comprises of three different microservices. I have a front end, backend, and the storage layer. And now I want to deploy the new version where my awesome application, 0.3.1, just bump the version of the front end. And I want to make sure that this deployment works well. There are no issue and my application is healthy. That means my response time has a good value. And the first thing is how can captain understand if my, does it work? Not really in remote, sorry. How can they can understand if my application is comprised of these three microservices? Well, the cool thing is that we set in front of your Kubernetes cluster. So no matter which tool you're using, you will apply some manifest to the cluster. And all manifests will pass by captain so we can watch everything which is happening inside your cluster. And this helps us to provide observability out of the box through Prometheus metrics and open telemetry traces, but also create a logical model how the different services and pod belongs together in your cluster, how the different business application are distributed inside your cluster. We can create this application concept so we can know those three microservices belong to one specific business application. And how we can do that is through recommended label for Kubernetes. Kubernetes recommended multiple labels for it, but we only require three of them and one is optional. So the app Kubernetes HIO name is the one that we use in order to identify and give a name to your microservice. Then the version is a nice version if you want to provide it, but if you want to use latest, we pick it from your container. And the one that we use in order to relate together different components of your cluster into a business logical app is the part of. If we provide these three labels, then captain can do all the observability around your deployment and bundle up together a business app. So whenever I deploy from 0.3 to 0.3.1, I know the depth of that. I know that only the front end service is now bumping from 0.2.0 to 2.1. So I can only test the quality of this new service if it makes sense or not in order to prevent slow applications. And how we can check if we have a response time which is in some pattern that we expect. Well, we follow the SRA book. For this SRA book, set three different values that you can use. SLI, Service Level Indicator, SLO, Service Level Objective, and SLA, Service Level Agreements. An indicator is nothing else than a metric, a signal, something that they can measure from your service. An objective is a goal that they can set on a specific metric. For instance, a metric can be response time, the one of this example, and the goal is should be less than 300 milliseconds. And a cumulative set of objectives can support the SLA, which is an agreement that usually stands for how many nines of uptime do I have in my service. And thanks for multiple objectives, I can guarantee an uptime or not. And captain sits right in the middle. It provides you a way to define SLOs and SLIs. And OI does that through abstracting any observability platform from your Kubernetes cluster. So captain sits between any observability platform of your choice, fetches the metric from these observability tools, translate them into Kubernetes native metrics API. So every other tool that understand Kubernetes APIs can work with. So I don't need any more point-to-point integration. I don't need any more of my new tool to be integrated with Prometheus, Dianatris, Datadoc, you name it. We do this abstraction such a way that any tool can integrate with us. And also, natively tool of Kubernetes like HPA, Horizontal Pod Autoscaling. So SLIs are coming from an observability platform that you installed and you maintain. And captain has some special CRDs where you can define the SLOs. So captain continuously fetches the metrics from your observability provider, translates that into Kubernetes native metrics and checks the results of this metrics against some well-defined SLOs that you set. So it's not just that at the first deployment, you validated, oh, my application is good, but we can continuously monitor this value because after you deploy it, a week after, maybe you have a memory leak so the response time can decrease. So we continuously, continuously monitor that and you can hook your tool to react upon that. So you can do the day two operation on top of it. So now that I know that my application is healthy, it's doing great, I just deployed a new version. How can I promote it across environment, sorry? And if you look, normally, our GitOps setup, we have a place where, on the left-hand side, we have a repository. It can be any type of repository. You install Argon or Flux or any other GitOps tool of your choice that is watching the state repository. And whenever there is a change in this repository, the tool will try to sink these changes into your community's cluster. And since you just installed Captain, now you know that your application is healthy. Why don't you just let Captain trigger an action that can bump the version of the artifact in the next environment, in the next repo that controls the stage of the next environment. So for that, Captain knows when an application is being deployed. Since we watch everything happen in your cluster, we know after you deploy your application that all the deployment finished, you run some evaluation that everything is healthy, everything is okay. So we let you hook a container that triggers the promotion. So we just provide you a hook where we provide inside this hook all the contextual information about what was the deployment, all the information about the service is being deployed, the evaluation that they've been run, the results of them, and then we provide this as an environmental variable to the container that you provide. Why we don't take any opinionated view on the container to run the promotion because since a year at KubeCon we speak a lot with many practitioners and we define that everyone is special, everyone has their own use case because every company is different. There are legal process involved for making something up to production. The structure of the organization is different so we can come up with one size fits all. Therefore, we found out, we just let the user do what they know best, how to promote the artifacts. We just provide them the context to make informative decisions for it. And this hook captain let you promote artifacts across the different stages. So now that you have your application automatically being deployed from dev to production, there is an issue. And you need to find out what went wrong because now you have a lot of logs to fetch and try to read and identify problems. You have a lot of communities event being generated and you need to find how can I process all of them in order to improve the signal to noise ratio in order to be more effective in addressing problems. And for that, captain provides lots of metric out of the box. In particular, a subset of the Dora metrics. So you know when you have many successful deployment, when there are failure, what are the versions that are failing, time between deployment, so you can also learn teams inside your organization that are very good in deploying the new cool things into production very quickly. You can learn from them, watch their pattern and apply them to other parts of the organization that are more junior to Kubernetes in order to support them and be as effective as the rest of the company. But beside that, captain also provides you with open telemetry traces that you can ingest in any platform of choice. In this case, in this example, there is Yeager that shows everything that happened during a deployment of your new version of the application. And if you zoom in when there is an error, I know that in Yeager, the UI is not the best to show this, but if you see in the middle in the text, there is error equal true. So this trace failed and we can see in the span events why it fails. And the reason for this is that my response time now is 836 milliseconds, which doesn't meet my criteria of being less than 300 milliseconds. So through traces, you can quickly identify when there is an issue and also know which microservices has this issue. And if you recall a bit back, the example about the application, my awesome application being promoted across stages, we can also provide this trace ID across the different environments during the promotion phase. So you can then link the trace parent across stages. So whenever you have a production trace and something went wrong in production, you can drill down back across stages following the different spans in order to understand the process and try to follow why it went wrong, why something from that rich production was not good. So you can try to learn from your errors and try to improve the situation for the future. So now that I hopefully help you for captain, how can you use captain? Well, the best way to use captain is just to use our home charts and install them. You find all the installation on our website under captain.sh. And as I said before, we monitor everything inside your cluster, which might be a bit of a too high permissions required for many of you. And for that, we don't do that out of the box. You need to opt in into this from captain. So you need to enable which namespaces captain should watch for. And for that, it's as simple as adding an annotation about captain.sh slash lifecycle toolkit enabled. If enabled that in any namespace, then we will watch everything just inside those namespaces. And the cool thing is that we are cooking a stable release 2.0, which contains all the goodies that I just talked about. And we are currently have just a release candidate for it. So please try that out, be vocal, tell us what went good, what went wrong in order to polish all the rough edges and make a very perfect stable 2.0 soonish. We are in the CF slack, but we are also here at the pavilion. And before I conclude, because I'm running out of time, there is also some movement around the captain community. Argo loves captain. There was also a nice presentation from yesterday about this topic where we are exploring ways how we can better integrate into Argo because Argo captain works very well together. Argo does the synchronization and captain provides the observability and the tool to promote across stages. So why don't we work together more? And we'd like to hear more about this from the community about how do you use Argo? How do you see captain being integrated into Argo better? So please provide also your feedback under this issue, 355. It's in the captain community repo and share your feedback and the way that you envision this merge or this collaboration. And if you want to discuss more about captain, I am at the pavilion booth tomorrow in the morning and also Friday in the morning. So thanks a lot and see you at the booth.