 Well thank you so much for being a quick show of hands. Who is using, who has heard about captain before? Who is using captain right now? Who wants to use captain because they're intrigued by it? Awesome. Who is, visit us at the booth or talk with us? Awesome. Does anyone have no idea what captain is or does? Cool. Thanks for coming. I think you're all in the right spot, right? Because we give you a little overview of where we are, where we have been, where we are and where we are going and you see four people on the slide. Brad was supposed to present as well. Unfortunately he had to leave a little earlier going back to New Zealand so he couldn't make it but if you want to follow up with him because he's very active in the community we put the Twitter handle there as well so you can follow up or just join the Slack channel and then talk to him. Who are we? Who are you? Ana Margarita Medina. I am a staff developer, advocate, light step. My main focus has been reliability work for the last six years. I'm a self-taught developer, turned SRE, turned DevRel and that will all come full circle when we talk a little bit more. Who are you? I'm Andy, Andy Grabner. I'm working as a, I call myself a DevOps activist at Dynatrace. DevOps activist means I just try to help people in using observability data because I work for an observability company just like you which is also a great sign that we are two kind of competitors but we're all working together and help the community to become better in their delivery practices and the rest of repractices and this is what I do on a day-to-day basis trying to tell people what we can do with these tools whether it's observability platforms or open source. And that leaves us with you, Thomas. Yes, so my name is Thomas Schütz. I'm a principal engineer at Dynatrace and I think I'm the only one at the stage who contributes code to captain so I'm a captain maintainer. Yes, and I was a platform and systems engineer for about the last two decades and I'm trying to get all of my experience into captain to make the world a bit of a better place. Awesome, so maybe I think you want to kick it off a little bit on why we are actually here, why we are where we are. Yeah, so I mentioned I come from the world of reliability. I spent a year and a half at Uber doing SRE work specifically chaos engineering and helping Uber move from bare metal to cloud infrastructure. It was really knowledgeable stuff and spent four years at Gremlin doing chaos engineering and during that time I got a chance to meet Andy and Andy was like, Hey, SRE, like amazing hot topic. The Google SRE book had come out in 2016 2017. Folks are starting to learn more about who our systems are getting a lot more complex and it takes a lot to keep these up and running as a lot of you possibly have known. And how is it that we can do better? What are things that we can automate? Where is the industry going? And what that takes us to is when we think about reliability, we look at our SRE teams. Our SRE teams end up doing so many duties and SRE is also going to look very different in every single organization depending on the scale of your applications, how many users and customers you have. But we can see that they're in charge of managing data centers, cloud infrastructure, making sure that they're building tooling for deployments to go out properly, making sure that there is an incident response system that every single system or service has actually engineers on call, that they're practicing how to do on call practices, all the way to being like, how can I embrace modern technologies and push forward using chaos engineering observability to make sure that the failures that we're having today don't affect us the next week, the next quarter or next Black Friday. And we also see that we are learning from companies that have been doing this for a while. So we get to use more metrics that tie into reliability, like service level indicators, agreements, objectives. And with that, we get to think of like, how can we make it better? And captain takes that into account. We think about, let's look at all those duties that I just showed you. How do we automate them? How do we standardize them? How can we make sure that we're still thinking about experimentation in terms of leveraging a lot of the other tooling internally or other CNCF projects? And with captain, what we're trying to do is that we're trying to bring reliability closer to every single developer. How is it that they can know that they can actually deploy an application because their infrastructure is ready? Or how is it that they can know that as they're deploying their services, it's not just they deployed it and cube cuddle says that the pods are open running, but that users are actually able to have a good experience. And that ties into some of the work that Thomas is going to talk about next, too. What have we been learning about folks that are using captain? We've been learning a lot about folks that have been using captain and what I really like about my role, as I mentioned in the beginning, I'm an advocate, I really try to help people adopt this type of technology, right? We want to provide tools to really ensure that your systems stay reliable and resilient. And you mentioned that I mean, SLOs is a big topic, everything that Google has been talking about in their SRE handbook and in their workbook. Has anybody read the SRE? Has anybody familiar with the SRE practices? Yeah. So I think what I think is interesting with captain, what we try to solve is we try to make sure that your systems stay reliable or are reliable before you deploy into an environment, kind of your shifting left, we want to actually use SLOs before you deploy into an environment because you don't want to deploy in an internet environment that is not ready. And after you deploy, just as you said, we want to make sure that not trying to pods are up and running, but everything is good. So at the core of captain, we have to come with some of the practices from the side with ability in engineering book and Tars made a really nice comment. Tars is one of the performance engineers at Facebook. He invited us to present at his performance summit two years ago. And he said captain feels like a reference implementation of Google's SRE engineering and side reliability workbook. So I think that's a great testimonial from an expert out there. What else have you learned over the last three years? So captain was in, well, it was born in 2019, three years ago, actually January. So it's always going to be four years. And there's a couple of use cases that we've seen people adopt captain for the most common one is automated release validation. This is an example here from Rifeisen software. It's the software organization that builds all of the online banking software in Austria and also for other parts of Europe. And they are integrating captain into the Jenkins pipeline. So after it gets deployed, captain automatically validates their SLIs and SLOs by pulling the data from the observability platform. So automated release validation is really big. The next one is end to end delivery. This is an example from SAP. They have initially used spinnaker for their canary rollouts, and they switched to captain, letting captain basically do all of the continuous delivery rollout and making the decisions again based on SLOs on when is a good time to roll out the next batch of canaries. And as you can see here saving 80% with the 80% actually meant two things speed in terms of deployment, but also reduction of the configuration that is necessary. Last use case that we also see adopted auto remediation. This is from PNG procter and gamble. They were in the initial phases and were using captain to react on a problem in production and then having captain automatically execute remediation actions. If you want to read more about this, all of these three use cases, we have a YouTube video on the captain YouTube channel for Rifa as an SAP and PNG. So you can see, you know, it's widely adopted. Now, the other thing though is we got a lot of questions over the last years. And I see some of you in the room that we some of you came to us, some of you actually gave us a really honest feedback because if you look at captain the first time you say, so that means your new deployment tool, that means you're competing with Argo, you're competing with spinning or is that what you do? Then people say, so you're monitoring tool because you're looking at SLOs, is that what you do? Or are you an auto remediation tool? Are you an SLO tool because you're doing SLOs? So you are testing tool because you also execute tests or while you also workflow engine because you can do all these things. So we got a lot of questions because I think we really went off into many different areas and we built something that can be used for all of these things. But remember, we started with all of this 2019. The world 2019 looked a little bit different. I think in 2022, at least most of the people we talked to, they're using Argo CD or flux for the deployment. They're much better suited for that. They have Argo rollers for canary deployments. You don't need captain for that. There's great observability platforms. We have open telemetry with Prometheus. We have LightStep with Dynatrace. These are all great observability tools. So we don't need to think that captain is an observability tool. So there's a lot of questions and that also meant we had to focus. And before I go over to what we focused on, in theory, really what we tried to do with captain from the beginning is to put SLOs at the center. So you can see it here. We really tried and this is still what we what we aim for putting the SLO, the core principle of side reliability engineering at the center, getting the data from any of your observability platforms, whatever it is. And then really connecting to the tools that actually then do a certain job, like your Argo can deploy, your K6 can test. Your LightStep is doing the observability, right? So that was the aim. Now, before I hand it over to Thomas, what we learned and also in conversations with you in Valencia when we were earlier this year, but especially this week, we learned that 90% of the users that are adopting captain, they're doing it because they really love the SLO validation. What they really asking for is the majority says, Hey, I would like the SLOs to be easier integrated as pre and post deployment checks, because I don't want to deploy in an environment that is currently broken, and I don't want to break an environment with my deployment. The second thing we learned, we have built a lot of integrations, even though we standardized on a protocol or on a format, we had to build a lot of captain services. However, most of them have been built, but are not well maintained, because most of the people are really tending to go back to the core use case, which is I use my other tools for deployment. I don't need captain to do my deployment. So please give me a better integration into the existing way how I do things. I want to use Argo. I want to use Flux for deployment, but then I want captain to seamlessly do the SLO validation without me having to maintain yet another integration. And I think this is what we learned. And Thomas, this is what actually brought us to where we're sailing next. Yes, so saying thank you, Andy. In the time we have developed captain, we developed a lot of possibilities to extend other tools into captain. So we had around three integration options. One was dedicated captain services, which was a bit heavyweight, because everyone who wanted to integrate the tour had to create a captain service to achieve this. The second one was that we created our own runner called the job executor, which made everything a bit easier, but also not, also was not the perfect approach for doing this. And last but not least, we had our webhooks where you could trigger external tools in a very easy way. But we also found some problems and while I was working on captain the last, I think, one and a half to two years, I always tried to get github into captain and I had a very hard time with this. So integrating github in pipelines can be really challenging. And yes, in captain we found nice approaches, but none of them was really perfect. The second thing we faced, and this was a thing which Andy already told us, we had a hard time maintaining our integrations. So we built a lot of integrations. You can integrate almost everything with a job executor, but none of them was really, really nicely maintained. So I think from the integrations we had five to six remained, right? And last but not least, we were an event based system and we had our control plane and every tool which has been used needed a subscription on our control plane and the management of the subscriptions was also kind of challenging. So we decided to go on a new path and what we found out in the last three years and I think everyone of you agrees is that Qanitas is the leading platform for cloud native apps. So I think a few thousand people are here at the conference. I think this is the proof that Qanitas is the platform for this. The second thing is that github is the dominant approach to deliver these apps. So we need to find a way how to get continuous delivery on the on the platform instead of having a whole lot of tools. Therefore, we try to standardize task definitions, evaluations and application life cycles using captain. So we want to get all of these things into the into the cluster. We don't want to be so sorry we want to be vendor neutral and anyone should be able to integrate in a very easy way. And last but not least, and this is the thing I mentioned before, captain shifts your delivery process to the platform in a cloud native way and pipeline less. With this, we created a new captain sub project, which was called, which is called the captain life cycle controller. And this ensures that your application deployment is stable and observable. So these are also is also the reason why some observability companies are involved in the project. We wanted to do this with a minimum of configuration effort. So if you take a look on the repository and if you look take a look in the configuration of the life cycle controller, you will find out that we tried to avoid creating new custom custom resource definitions as long as possible. Some some things did not existing human leaders. Therefore, we had to invent something. But in fact, we are trying to to utilize the resources available in human leaders as far as possible. Furthermore, github for us is no feature. It is a requirement. Therefore, everything we are building is totally working with github. It doesn't it. It works with all of the all of the principles of github. And we are trying to make the delivery pipeline list. External tools and control planes should be easily easy to integrate. So we created some kind of custom resources for our into for our integrations, where you can specify your tasks either in line, you can use them from a from a web server or whatever. But I think this is done in a very intuitive way and very easy for each for each of the integrators. You get deep insights in your deployment process as we will see for and for the consequence. So you see at every at every point in time where your deployment stays at the moment, how successful it was, how mature will your process will be and where you can evolve your deployment process. And last but not least, you get cloud native and application aware control over your deployments. Okay, so how do we achieve this? At first, we are extending Kubernetes with application life cycle awareness. So if you think about Kubernetes, Kubernetes mainly knows about deployments, stateful sets, demon sets and whatever. But there is at the moment, at least at the moment, no concept of an application in Kubernetes. So you can not bundle multiple workloads into one application. This was one thing with where we tried to extend Kubernetes. And for the consequence, for us, it's not really relevant which deployment tool you are using. So you can use our city, you can use flux, you can use github and all of the tools you like. In fact, all of them do nothing more than applying anything as something to Kubernetes. And this is where our work begins. So after you deployed something to Kubernetes and when you define an application, we are doing application pre deployment checks or tasks and evaluations. Therefore, you could check for instance, if you have enough infrastructure to get your application running. At the booth, I showed an example where I tried to test if there is enough, if there are enough processors left to get this application running. In further consequence, after all of these app pre deployment tasks and evaluations have succeeded, we can do the same thing on the workload level. For us, a workload is something like a deployment, stateful set and so on. This runs per workload. After that, so after all of the pre deployment tasks have finished, we can let you Kubernetes, it's work and schedule the port. And after the port has been deployed in this running, we can do the same thing with post deployment tasks. Also on a workload level, but on an application level too. Furthermore, our lifecycle controller ads, they want tasks with simple scripts. So everything you do here, such as checking for infrastructure readiness, validating error budgets, or for post deployment things, running tests and evaluate SLOs, can be done in a way which most of us understand. So you can write simple typescript functions. In the future, this might also be running a container, or if we, if needed, you can also attach custom run times. And last but not least, you always get the full lifecycle observability. And you can make deployments observable via OpenTelemetry in a vendor neutral way. Therefore, the only thing to integrate your observability tool, as long as it's compatible to OpenTelemetry, is you have to configure the OpenTelemetry collector in a proper way, but you don't have to make changes in the lifecycle controller or in your tooling. And you get some fancy dashboards out of the box, which I will come back later to later. Thomas, one quick question, because I heard you give this demo several times at the booth today. People always ask technically, how does this quickly work for people that want to understand how we actually inject ourselves into it? So technically, we are looking in your deployment objects, state for set objects, if there are some annotations on the objects, such as ones we defined, and we are also listening on Kubernetes-recommended labors. We have a webhook running. This webhook checks for this labors or annotations. And if they are set, we are injecting our own schedule extension. And with this schedule extension listens on our custom resource definitions, takes care of that your pre-deployment tasks have finished. And when they are finished, we bind the port and our scheduler binds the port and therefore it runs in Kubernetes. Very cool. And what I loved about it, not to the feedback we've received, that means whatever you use right now for a deployment tool, some of you work in organizations that have multiple tools, if you try to enforce pre- and post-deployment checks in your tools, you potentially need to re-implement the same thing in Jenkins, in GitLab, in other tools. And with this, we are bringing all of these lifecycle checks into the core platform. And that's the beauty. And you can standardize all of this within the platform. I think one of the other wins with it too is that you actually get to standardize based on the severity of the services. So when we think about having those large scale systems, like those 10 critical applications, those kind of get to keep the exact same evaluations so that you kind of keep up that same similar service level objective at the end of the day. And that you can get more information about what's going on in your deployment process by these metrics and traces we are exposing from with OpenTelemetry. So for instance, we managed to get information about how much time passed between two deployments on a workload and application level. And this is a good indicator for the maturity of your deployment process. Furthermore, we let you know how often a deployment failed and succeeded, which is also very nice when you want to find out how good your application is developed when it gets to production. And on the other side, we also get a full trace over our deployment process. So we know how long it takes until the pod runs. We know how long our pre-deployment checks run, post-deployment checks and so on. So we always know where we can approve. And the very nice thing in this case is that we know this on the application level. So when we deploy a whole application and not only one workload, we know how long it took to deploy the whole application. And this is very important in some cases. As I told you before, you can run your own tasks and some of you might have wondered how this looks like. This is some kind of a configuration. For that, we created two custom resource definitions. The one is a captain task definition where you can simply define your task. You can add your code in this custom resource definition. But you can also have this outside of your custom resource definitions on an HTTP server, but also in a config map. The second thing we created was a captain evaluation definition. And this is here to find out how to check some metrics against thresholds to find out if the application simply runs. Just for completeness, this might change a bit in the moment in the future when the project evolves and as we know in which direction this will go. So as a summary for the Lifecycle Controller. When you are using the Lifecycle Controller, you get application-aware deployment. And I think this is a really, really nice thing. So you can make sure that the whole application runs as you expected. And you also can run tests on your application. Furthermore, you get vendor-neutral observability using the open telemetry stuff. The integration of external tools and your own code is very easy. So you simply have to write a function. You can easily integrate in webhooks and whatever you like. And this is in a very simple way. And last but not least, the whole thing is in installable in a very, very easy way. So the configuration is in your Kubernetes manifests and in two to three external custom resource definitions. The installation is one manifest at the moment. I hope this will not change. And you can get started in a few minutes. So for me, the installation of the demo environment here took around, I think, one minute. And I was very slow in typing. Okay. With this, back to Andi to bring this home. Yeah, exactly. So first of all, just to recap, we have, if you go to the captain website right now, you will see what captain 1.0 looks like, our first iteration of it that I mentioned in the beginning. We are currently targeting a 1.0 release in November and then with a long time support. There's also all the information you can find on. What we built here, lifecycle controller, the future, the stuff that we learned over the three years and really making it, as you said, pipeline less and Kubernetes native. We would not have been here without a lot of help from our contributors. So you can see here all the people that contributed, especially also the people that helped us from the Google Summer of Code. Our community rockstars and also maintainers like Brett. So please, if you see any of them, thank you. And if you happen to be in the room and recognize yourself, thank you as well. Yeah, the last thing before we open up for Q&A, if you want to get in touch with us, please join the conversation on the CNCF Slack. There is a captain channel and now also the captain app lifecycle working group. You can also help us by starring us on GitHub and spreading the word, giving us feedback. I think that's very important. I see people taking pictures. That's great. If I'm not mistaken, there's one more thing. Exactly. If you like this session, please give us feedback. That's always very welcome. And we have a lot of swag here. There's like t-shirts in medium, large, small and extra small. We have stickers. We have some hats and we have even more of the swag down at the captain booth in the pavilion. I think we're still there until 4 30ish or 5 o'clock. But now we have one minute for questions, maybe two or three. Yeah. So the question is you're already using captain, sorry. I know I said early, I will run around. But I just repeat, you said you're already using captain 1.0. How do you migrate? I think if you're happy with captain 1.0 right now, there may not be a need to move. However, you right now need to integrate captain 1.0 in your existing pipelines. So one use case would be instead of integrating and calling captain 1.0 from all of your pipelines, install the captain lifecycle controller to the pre or a post task, you can trigger your existing captain sequences, for instance. There will be an option. So can we migrate what we already have from SLO definitions, etc. You would not migrate. You will keep your captain 1.0, but on the bottom basically what you replace is the benefit of the lifecycle controller. That means instead of calling it from what you use Jenkins right now Jenkins, instead of calling your captain 1.0 Jenkins, you basically let the sequence be called from the lifecycle controller. That means you're eliminating the need of integrating Jenkins with captain at all because it will happen implicitly. And the thing like going further, Thomas was saying right now the SLO validation we just started with it on the captain lifecycle controller. I'm pretty sure we'll get on par with what we have in captain 1.0. We are also working with the SLO community to figure out how we can also adapt this standard. This might be basic, I'm unfamiliar with the product, but what is, how do you define an app? Is it a single deployment or if your app is more virtual and has multiple deployments, can you work that into the definition of a captain app? So currently a captain app is a bundle of workloads. And this bundle has a version. Therefore you can define an application in a way where say I have the application for instance our demo application was called the potato head and we had five workloads in different versions. And this is what an app defines for us. So we are not doing progressive delivery at the moment. We are only making sure that the application in this version with the workloads in several versions is available at the moment. But this we are open for discussions for this every time. How does your hotel integration actually work? Are you like maintaining a prom server or something? And am I sending all telemetry from all services to this? Or is it like you're hooking into just the particular application or something? How are you doing that? Sorry I heard myself. We are sending all of the traces metrics we are collecting from the life cycle controller to an open telemetry collector and this one passes this further. So we are not instrumenting the applications themselves. I think there are many tools which can do this much better than we in the life cycle controller at least. And yes, therefore we are only collecting the data or the deployment related data from the life cycle controller. Exactly. No, we decided to export to the open telemetry collector so we are not using the exporters themselves. Because we found that this would be easier to integrate for vendors. What we are doing is we are creating our own traces because we are executing these pre-deployment checks and the post-deployment checks. This will become a trace from the app that gets deployed by let's say Argo or whatever tool you use. Whatever you decide there how to collect the data maybe feed it into LightStep, maybe feed it into Dynatrace and there we have the ability to query this data. But what we emit is really just an open telemetry trace on the actions that the life cycle controller takes executing the pre-deployment tasks and that is exported via the collector. I think you are talking about the evaluations, right? So how we can get data from an observability solution? This does only work. Now I understand the question a bit better. So it is more about getting the SLOs from the observability solution and this is not based on open telemetry at the moment. Sorry? It is prompt QL based and we can, so in the current in the Captain 1.0 we had a concept of SLI providers. We have providers for Prometheus, for Datadog, for Dynatrace, for SumoLogic and for Splunk. Now with the new life cycle controller what you I think have shown earlier we are currently supporting prompt QL queries and we also want to get on par with what we had there with the concept where you can plug in different providers where you specify in a CRD what type of query you want to execute against that target environment and that query is then executed by a data source provider. And I'm sure there will be some standard implementations for whatever. If we decided to go with open SLO and they already have a provider then they should work. In the moment, yes. I think for us in the future this will be simply an observability provider where we can query data. Therefore, yes, we didn't write it until now but it might be possible in the future. So, if you have some requests such as this or want to discuss these things a bit more in detail we can stay in touch and talk about this. And I would say maybe if we have more questions we need to clear the room for the next session I think because we already are out of time. We will, folks, there are stickers, there are t-shirts hats and we will be at the booth for the next hour at least.