 Let's get into the big topic of open source, something that we actually didn't realize. This is so awesome. We are an open culture that is actually a process that a developer or, let's say, as the Kubernetes ecosystem really brings. And we are live. Welcome, everyone, to another exciting episode of Get Ops, Guide to the Galaxy. It's a very special episode today. It's very exciting, very excited, very happy to introduce our co-captain. We finally get a co-captain. So Hilary, wait, you are on this side of me. Hilary, I'll let you introduce yourself. Kind of, yeah, tell everyone who you are, what you do at Red Hat, background, all that fun stuff. I'm pretty sure we'll talk about your shelf as well. Oh yeah, my shelf, right? Everybody likes my shelf. So, hey guys, gals and undecided. My name is Hilary Lipsick. I am a principal site reliability engineer and global team lead for some of the, one of the managed services teams running on OpenShift dedicated. So if you've heard of our offerings of Managed Kafka or Red Hat OpenAPI manager, those types of things, that is the team that I am leader of. Let me see, I've been at Red Hat for about, I don't know, year and a half now ish. And actually prior to being an SRE here, I was a quality engineer for 10, 11 years. I like to say that I dedicate my life to writing software that impersonates people. Basically, that's what I do. I think it's a very, very apt description. More or less. Yeah, so that's really, that's really it. In my time outside of Red Hat, it's still technically part of Red Hat. You can find me at the WIT network where I run a community of that nonprofit for women in software engineering. And they're actually kicking off this month. We're kicking off our coding coach coffee hour, which is specifically an effort designed to help particularly women, but anybody do well on code tests and get past the code test and just in general upskill as engineers. Yeah, I actually just dropped the link into the chat. So for those of you who aren't familiar with the WIT network, it's all the cool things and everything that Hilary just described, probably more in detail in the link than I can, then we can spend the hour on. So please, please check that out. It's really, really cool. And I'm very excited. I'm very excited to have actually someone on who like actually does things, right? So a lot of us at Red Hat, a lot of people that bring on, you know, solution architects, the people here in the BU, we work a lot on theory. So it was really, really excited to bring an actual SRE onto the show and be part of the, you know, the permanent hosting here. So we finally have a co-captain here at Good Ops Guide to the Galaxy. You know, I've worked with Hilary in the past, she's, you know, she's more technical than I am. But I guess if that's even possible now, I guess now being in the BU is like, they say that's where all engineers go to die, I guess, is in the BU. So being there is, you know, I'm very excited to have someone who does this from day to day. So, so cool. So welcome, welcome Hilary, welcome. So if you guys have any questions, again, feel free to drop it in the chat. We'll be happy to talk about it here. So I'm excited for our guests today. So our guests today, you know, we were kind of, Hilary, we kind of decided like, well, you know, when should we bring you on now? You know, we actually made a decision a while ago about, you know, who the permanent co-host is. And Hilary was like, I want to jump on this show, because we're talking about SLOs and SLIs. And that's like your jam, right? That's like you is what you love doing. And so, so yeah, so our guests today, they've been kind of hanging out in the green room. So we'll bring them on. So if Stephanie, the producer can bring them on. It's from Dinatrace, right? So, you know, talking about, so I used to pronounce it ketten, right? So like, trying to pronounce it like the German way, but you know, Andreas told me it was actually captain. Talking about captain. So Andreas or Andy, I guess, Oleg, I'll let you guys introduce yourself. You know, who you are, what you know, what you're working on, what you do, all that cool stuff. Yeah. Oleg, should I go first? Do you want to go first? Go ahead. Okay. So well, thanks first of all for having us on the show. Thanks Hilary for doing the great work that you do in your WIT network. And yeah, what do I do? I feel a little connected with you. Hilary now knowing that you have worked in quality for about 10 years. I think my screen just went black and now it came back. Hopefully I'm still. Yes. Yeah, I know we can see you. Yeah. Perfect. Because Hilary, my background is also in software quality. I started my career. Yes. Quality. Because who needs software that doesn't work well. So this is why. Quality is very important. And the fun thing is I started my career as a software tester on a testing product. So I did performance testing on a performance testing tool. Back then it was called Silk Performer. It was when I started the company was called Segway. Then got acquired by Boiland and then became Micro Focus. And I think now it's part of, I don't know. But I was a performance tester on a testing tool. I did that for eight years. And then 14 years ago, almost to the day, I started with Dynatrace. And there I'm kind of, you know, now I call myself a deaf rail or an evangelist or whatever you want to call me and try to help people understand what we can do with the data that observability gives us, especially as we are using it to drive automation around delivery. So making decisions on whether the quality is right based on metrics to push stuff from deaf all the way into prod and then use data to make better informed and automated decisions to bring systems back to a healthy state to keep systems reliable. That's kind of what I mean. That's what I mean. It's like, what haven't you done? No, it's just, I think it's a very small thing, but I've done this consistently for a long, long time. I don't know a lot of things that I should know, but I'm happy that I know what I do know. Awesome. Oleg, Oleg, so go ahead and give your quick introduction. So happy to join the test guild. Actually my background also started from quality assurance, but quality assurance in hardware and embedded systems. So first five years of my career, we were rather about testing devices. Then I started more hardware development, mostly digital hardware. I have a PhD in hardware design for what it was, but when you do quality assurance, you need automation, hence you need tools. So at some point I started using Hudson Jenkins. And at some point, if you fix something with CI, you become a CI guy. So this is exactly what happened to me. So I started maintaining Hudson, then Jenkins. I started contributing to open source. Somehow ended up in Jenkins core team. So if you use the Jenkins in the last decade, probably you have seen my name on Jira or in GitHub. And starting from 2019, I started following Captain because it's a really nice project. It has been exciting to follow it. And this summer I started contributing a bit. And finally I joined Dynatrace to help with building a community around Captain. So now my job is rather community management and whatever is related. I call myself one of the captain of this. And yeah, this is a part of my job. Awesome. You know what's really, really cool is that we now have like talking about like QA, we have like QA from from like front-end engineering all the way to like embedded system. So like if you guys, so those of you watching, if you really have questions about testing almost anything, I think this is the stream to do it in. So yeah, so speaking to Captain. So what's interesting is that, you know, when I first thought about bringing, you know, when I first heard of Captain and thought about like this, I want to bring someone on to talk about Captain. And then, you know, I jumped on a call with Andreas and Oleg and it was completely different from what I thought it was. Right. And so, you know, they kind of give me a good education and good background on it. So I think, I think maybe we should just start there. Right. Let's just, let's talk about what Captain is and, you know, let's start the exploring process there. So, so I think I'll hand it off to you guys and you and Oleg to talk about, first of all, what is Captain? Yeah. Is it okay if I quickly share my screen? Let's do it. Yeah. Okay. Let me figure this out. Screen number two. I'll move this out of the way because, you know, you know how you look like. Yes. I'm just, I'm just using one or two slides. I'm very fortunate to be able to travel again. I mean, keeping fingers crossed that on Sunday I get on the plane to North Carolina because I will be speaking at DevOps days in Raleigh next week. And I think I want to just use maybe the intro slides because also there, there's three things that we might talk. What is Captain? Why we built Captain? I think that's also an important piece because why yet another tool? Maybe you think there's already a solution out there. And then how Captain works. I really just want to get the first question addressed. What is Captain? And with a segue also into the whole topic of SLIs and SLOs. So you're the first ones now to actually see this deck. Oh, we're getting a preview. Awesome. Exactly. But don't tell anybody. I'm not going to tell everybody. So what is Captain? In a nutshell, Captain enables you to do on the one side multi-stage delivery. Captain itself is an event driven orchestrator of what we call sequences of tasks. The first use case is for, we call them, I think the DevOps engineers that are alive or responsible for pushing artifacts through a pipeline multi-stage where Captain can orchestrate tasks like deploy, test, validate and then release into the next stage. But Captain is not just yet another tool that can push stuff through a delivery pipeline because the second use case is around orchestrating day one and day two operations. And for, you know, the way I define it, this is everything to make sure that you can successfully, you know, release your canaries. So making sure that the systems, the canaries get scaled out if it's good or scaled back. But also when there's a problem in production, it can automate the orchestration on, let's say, you know, fixing a problem. So it's assessing the situation and then is triggering based on the problem that has been identified is then triggering the right actions and always validate if the action brought the system back to a healthy state. Hopefully that's it. It can go through multiple iterations and eventually if it solves it good, if not, it can escalate. Now the key magic piece here, and this is, I think, one of the interesting topics we want to talk about today is that embedded and kind of baked in at the core of every sequence it orchestrates. Captain makes data-driven decisions based on SLOs, service level objectives, and we'll go into more detail on how we treat SLOs and how we define it because I think there's a lot of people that talk about SLOs and everybody has a little different understanding of it. So we'll show you how we do it. Most important thing is that the data obviously comes from your observability platform of choice. The first integrations we have built is obviously we are a CNCF project. So Prometheus was the first integration. Then we built a Dynatrace integration because most of us contributors are from Dynatrace. We already have a Datadog integration that one of our community members built and there's more out there integrations that people have asked for. Now, on top of this is all of the tools that actually do the individual job because Captain itself doesn't deploy, it doesn't test, it doesn't rollback, it doesn't scale, it doesn't notify. It really orchestrates all of these tools and it does it through an open event standard. What that means is we're really taking away the pain for you to have to write lengthy automation scripts that call the tool for deploy, then validate if the deployment happens successfully, then call test, validate if test was successful, then reach out to your observability platform and figure out your SLOs status and then do the next thing. So we take care of the whole orchestration. We're using an open event standard that also OLEC especially is driving through the CDF Foundation, the Continuous Delivery Foundation. So we standardize on the way we communicate with these tools and at the core, after every task, Captain makes an informed data-driven decision to figure out in which direction the process should move. So this is kind of Captain in one slide. But I think this is important to understand kind of what the use cases are. And before I kind of open it up, one key point and I think this will help us drive the conversation around SLOs. What really happens there is every time Captain has this evaluation stage, there's a core component in Captain that we call the Lighthouse Service. The Lighthouse Service is driven by your SLOs. So you can define SLOs in YAML that you store in the config repo that Captain holds for you. I mean for every Captain automation project there's a Git repo. So you put in your SLO where you specify what are the SLIs, what are the metrics. You can see here error rate, JVM, memory, count, number, database calls and what your objectives are. But Captain itself doesn't do anything with this yet because the first thing it does, through the same eventing mechanism, it's reaching out to your observability tools and is asking them to query the individual value. Because in the SLO, we have a separation of concerns here. The SLO just says error rate. How that error rate is captured depends obviously on the observability platform. So for Prometheus, you would specify an SLO YAML with your PROMQL for Dynatris, it would be our Dynatris metrics expression language. And so that means that the, we call it the SLO provider reaches out to the observability platform, pulls in the data, pushes it back to the Lighthouse and then the Lighthouse compares the values that come back against the criteria. We are evaluating then every single metric, every single SLO. And then on the very top, we calculate a total score between 0 and 100. And then this is then the deciding factor on green, yellow or reds that then decides how the automation process goes on. So this was kind of all of the slides that I wanted to show, but hopefully this gave a little bit of an overview of what Captain is and why these SLIs and SLOs are so core to Captain. Yeah, I think, and anyone who, if you have a question, go ahead, feel it, feel it drop it in here. I think out of the three of us, I think I'm the most ignorant one because my background is mainly in operations, right? And so I've, you know, the importance of data driven deployments, right? I think is what Captain is trying to focus on. And so, and actually, we just got a question here from Rockhound. So Rockhound, welcome. A CICD rapper of all other solutions, question mark with a test dependency at the end. Is that, is that what? I guess I did a crappy job in explaining what Captain is because if the first thing that comes to mind is it's yet another CICD tool rapper, then I really may have missed the point. Captain is not another CICD tool. What Captain is, it is helping you to orchestrate the tools along what we call a sequence, an automation sequence. And yes, obviously, we always bring the example of pushing artifacts through a delivery process because that's why, you know, I have to deploy test and validate here. But the other big use cases for handling problems in production. So if you are Prometheus raises an alert to get a Prometheus alert that something is wrong, then you can also orchestrate all the remediation tasks where you can say I first want to notify my SRE team that there's a situation going on in the same step. Then looking up a list of sequences that should be orchestrated kind of like your runbook based on the problem that comes in and then Captain is then triggering your individual tools. And now the thing is, I really have a hard time with the word rapper. Because what we are doing here, we have it's an event driven systems that mean there's no hard coded tool integration that you build where you say in my particular project, I want to call a Jenkins or I want to call Helm to deploy or I want to make a notification call to Slack. It's all through event subscriptions and maybe that's actually another good point. I would like to think of it as so and then this may be a really simplistic view. From my point of view, I really like to think about it as almost like an event bus, right? So there's certain events that happen, right? And you want certain things to happen based on those events. It's just so it's just so happen is just sitting, you know, I guess at the heart of your deployment. If I think the powerful thing here is that there's essentially a rules engine, right? So it's not just that there is the messaging and the events going through which is super important but you also have a rules engine. So that that is the power in my opinion here. I think that's the power and the other power is there simply it's loosely come like what did we do with all the monoliths that we broke into smaller pieces? We in the end, you know, we stitched them together through an event bus that also has a rule engine. You typically have like your business process management where you then say I know exactly what needs to happen, but you can always replace individual pieces because it just subscribe and not a tool to handle a certain piece of activity. And that's the same thing what we do here, right? If you look at I have a multi-stage sequence here, it starts in staging and then in production. And what you see here is we call I call it delivery here. This is the delivery sequence and I have multiple tasks. Monaco stands for monitoring as code. This is where I basically say I want before I do anything make sure that my monitoring system is configured. Now, in my case, maybe for my project, I use data doc. So I have data doc subscribe to the monaco event or I have another tool, let's say dinotrace that subscribes for it and then does the configuration or it's Prometheus. Right. I mean, basically you can through event subscription say which tool should handle which type of task for which particular project and which particular stage and the events that are sent here. So you can you can see the little icon here and just pushing on the deployment. This is the events that we're standardizing. So this is not just yet another proprietary, let's say protocol that we're introducing here, but we're really working with the community to standardize this because the challenge that we have seen is that every tool out there where it's deployment, test, what's the ability tools notification everyone has the proprietary API. This is also why it is so hard to build and maintain integrations between all the different point tools. And we have introduced or introducing here a standard way of how these tools communicate. And take away the pain. Yeah, it's like, you know, like, you know, for me, my world is Argo, right. So like if you have Argo, right, and then you have information coming from dinotrace, since you guys are on. Right, like I have like the dinotrace, you know, feeding those metrics to me like how do I make Argo speak dinotrace and how do I, you know, and, and I can set of writing that integration right there's like an event bus that I can basically loosely coupled. You know, what my deployment mechanism is is, you know, I guess decoupled from the information, the metrics that I'm receiving. You're totally right. So there is standard technology called cloud events, which is exactly designed for delivering various kinds of events. So what you see here, captain and many other tools actually built on the top of it as producers or receivers of events, there are many standards like CD events, which was referenced before by Andy, which actually extend this cloud event specification and allow to prioritize particular areas. But ultimately, it's a universal way of data interaction, which is extensible and if tools know how to speak to each other, it unlocks a lot of possibilities for your systems, because as we build modern applications as micro services, the same for delivery and operation services, build them as micro services loosely connected with each other through events. And this actually communication layer enables us to build that efficiently and to use many building blocks like captain to assemble into an operation system you need for your applications. Yeah, I wonder. So, so you're saying you can also use this information for rollback. Yeah, so, so like if, if, you know, I guess you're monitoring your system, the information is coming that this latest deployment broke something that for whatever reason, and this will make Hillary cringe past in past a past QA for whatever end, but it's so like something in production is breaking. You know, you can have the same, you know, rule set, I guess, Hillary, I think you said that I think the best way to say that is you have this rule engine that can then roll it back to a previous version. So the exact so the way maybe and I also see some comments like is this just another if then do that kind of tool for cloud ops. I guess I still need to do a better job in explaining it's not just if then the nails. So maybe let's have a quick look behind the scenes so that people understand what behavior behind every automation project is everything starts in the beginning with what we call a shipyard file. The shipyard is reading out this definition of the process. What you see here is we are grouping or you can define so called automation sequences like I have my delivery sequence for my staging environment and here you specify your tasks monaco monitoring as code deployment evaluation. You can have different tasks in a stage like here is a delivery sequence that also does an additional test, or here's just a delivery with tests. And what we also have here is in production. I have another delivery process where I say I want to trigger that delivery in case staging finished successfully or staging delivery with test finished successfully now I can also go down here. Coming to the rollback. What you're basically doing is you are specifying sequences and you connect them together so a little bit like if then else of course from a sequence perspective. However, the magic what I see here is you don't see any tool. I don't care here in the definition of the process who is doing the rollback because I maybe that defines that process doesn't know what's the target tool in my environment. And this is now where the event subscription comes in. So if I go back to my captain UI, we call this the bridge. And if I go here on the left side to our settings, then we have what we call the uniform the uniform is kind of like all of the tools that captain has to steer the ship. And this is exactly where all this event subscription comes in. So for instance, I have my helm service here. I could also have here my Argo service. My G meter service, my Jenkins service. And basically I say, hey, I am now subscribing helm in my case to certain events for certain project stage and service. That means I can control your three events subscription for to which tool this event should actually be forward and then handled. And this allows me, I think this is the nice decoupling between process and tooling. I can easily switch tools as long as these tools obviously understand the event. And I can also subscribe multiple tools to the same event. Now, why would I want to do this? Well, because maybe I want to push notifications to a select channel every time a deployment failed or a rollback is triggered, but the rollback itself is obviously done by another tool. And so I can have multiple subscriptions to an event and I can easily switch tools because in the end, I just need to make sure that I have tools to fulfill the process. And hopefully this explains it a little, a little better. And you tell me if I, if I do. Yeah, well, I think you're doing a great job. Also, Oleg has been feeding me links. So for those that are following in chat, I've been posting some of these links to find out, you know, more of the information more, more about these definitions here. So Oleg, he doesn't, he was going to sign up for a twitch account, but you know, I'm kind of feeding that, you know, he said maybe next time next time you'll come with your twitch account. You just don't want to see my twitch account. Yeah, yeah, so there you go. Yeah, and maybe we'll maybe we'll rate it afterwards. We don't know. Yeah, I mentioned the uniform. Actually, it's a fun fact about captain that we have a lot of special terminology comes from now we call terms. So from shipyard to captain itself uniform. This web UI you can see on the empty screen, it's called captain bridge. And there's a lot of terms like that. So that is internal lore in captain community. I love that. I love that. I was actually wondering if you were going to mention. And I was wondering if you were going to mention sending notifications to Slack channels and so forth, because chat ops was going to be one of my questions. And for those not in the know, chat ops is sort of an idea of that when something happens, you get a chat message that you can interact with via Slack. And you can actually like the power is so so much so that even from like my phone Slack application, I can see that chat ops come in and do things like run a playbook based on based on the alert and that sometimes can be multiple options depending on you know various certain pieces of data. So that was going to be my question. Exactly. So that's basically, you know, I'm pretty sure actually can find captain's like web hook. So. Perfect. Thank you so much. I mean, whatever, whatever favorite, whatever favorite search engine you have, it's typically pretty, pretty easy to to find all of the different integrations that we have and how we work. There's also a lot of it's like Jenkins is another example obviously Oleg been very active in the Jenkins community, but he knows him there. We know how we can subscribe to to captain how to to Jenkins and send the information over but essentially what I am not sure if maybe it was a little too fast but in captain on the integrations in the uniform. One of the things we can do is we have a web hook service where you can select for which task you want to call an external web hook and every task has three stages. Triggered is when captain says I need this test to be executed who wants to do it. And then the tool that is actively doing it will then send back when it when it's starting the job and when it's finished. But I can also subscribe as a notification tool perfect that if let's say the evaluation finished. Then I want to make a call to to the slack API and obviously we have secrets here right we can configure secrets like your token you will obviously put into a secret. Here you can define your custom payload and then you can pick any of the data elements that are in these cloud events that I have that we've shown you earlier. That means you can obviously reference right the dashboard link or you can reference that was used for the slow evaluation. You can also reference the evaluation result the git commit the indicator result the total score and then you just push this over to whatever tool you want. And so I want to touch on something else that you mentioned. I have so I have so many like things I want to touch on here I think this is such a neat thing. Yeah this is your your favorite subject. Right. This is amazing. Yeah. I feel like I feel very much with Kindred spirits right now, especially with all the joint Kiwi background like that's amazing. But anyhow, there's something you mentioned and I wanted to ask a little bit more about it so to my mind as I'm evaluating this. It looks like this technology is really I guess I'll use the term it's nonlinear right. You don't have to just have like one set of things happening you can actually start spreading out through a branch of things which is neat. And I'm wondering kind of if you have any examples of that if you have any strategies for doing that in a maintainable way because the problem with making things spread out too far is of course losing track of them. Yeah. So from a spreading out perspective what we can do and I go back to my to my actually let me give you another example here. The demo rollout example that's using Argo rollouts here for for canary deployments. What you see here basically is not a not a spread out but kind of like a sequence of sequences. So I started in staging and I am delivering a new version just in the regular blue green deployment using using our helmet Argo rollouts to actually push it out like a blue green deployment when it's when everything is passed at the end. So tests have been executed. I think in this case I'm using low cost. You can see here another thing right where the testing tool is kicking it off and running tests and we do the evaluation. Then I'm going over when everything is green to production. And here what I've specified is a delivery sequence that is then followed by canary rollout phase one phase two and phase three. And this is if you remember earlier and let me just show you what we the shipyard file behind the scenes because behind every project. There is a git repository. And in this case it's the delivery demo. And here's my shipyard file again. So the shipyard file remember that's the the initial definition so you can specify. As you mentioned earlier the sequences and then the way you connect them and kind of said spreading it out is through the triggered on the triggered on means if one sequence is done then you can trigger another sequence. It's not that you can then kind of branch out into multiple areas. It's always like one sequence what then is what what when this is completed either with a successful status or failed status you can then say are we done or are we kicking off another sequence depending on the finished results of the previous one. So this is how we do it. We like what you're meaning is completely branching off. I mean it's also possible but I think from to this extent where you would say one sequence is done and you're kicking off 50 others. I think this is not where we've gone so far and I will also wondering as you said this will probably be a minute. It's doable right because we would then show it here in our UI we probably need to figure out how to better visualize it even if it branches out completely. But this is not the way we have we have seen it been used so far. No that makes sense. And I think that that's that's fantastic. You can always count on me to figure out some way to break your tool right. I will absolutely use it wrong. That's an SRE SRE life right like it's all use all use your tool wrong. This is what it is. Yeah, I frequently say that SRE is a type of quality department right we're driving continuous improvement through the observation of the production realities. Although I usually I think say it better than that but this is what we get today. Right so I wanted to ask you because it was a kind of a quick thing and if there's other questions I'll see to the actual audience who were here to entertain instead of myself. But I wanted to go back to the the SLO components and some of those definitions again just like I said it was just really quick earlier and I was like oh I want to spend more time there. Yeah. So what do you see here is kind of in the end the visualization that we have right so this means every time I run a sequence and I have to evaluate a step in there. Then captain will reach out and to the monitoring tool that has subscribed to the give me your SLI values. The values will then be brought back and then every value every SLI will be evaluated against the SLO the objective and then a total score will be calculated. So this is just the visualization of one of these you can scroll down here you can actually see the individual values that came back. One of the cool things that I like is that we cannot only compare or specify an objective based on an absolute value. So you can say I don't know host disk queue length should obviously be you know smaller equal than zero because otherwise you have a problem but you can also combine it like the here's a metric called service throughput per minute. You can say I want the throughput to be smaller than 200. But I also want to combine it that it should not increase by more than 10% to the baseline and the baseline is then calculated based on previous results so you can combine absolute objectives with I call it relative objectives. I'm not sure if there's a might be a better English term for that but this is what it is now how can you define all of this. There's two ways how you can define it. Remember what is behind every project. Come on it's a question for you. What's behind every automation project. Oh, I don't know. I've got a long list. I'm not sure what what leading question you're trying to ask. repository so all of the repository all of the all of the actual configuration because so far we only looked at the shipyard all of the configuration then really lives in the git repository. That means I have my shipyard here but then depending on how many stages you specify so per stage. We allow you to then put in all of your configuration files either your helm charts if you want to use help for deployment. I had low cost earlier so I would put in my low cost test script here or my SLOs would also be here. So for instance this is one of the of the samples that are run. I have my service called TNT APL service. I'm just quickly go back here. Right in this project. I'm currently in my service called TNT APL so the way we're structured is project stages and then your services where you can execute automation on. This is why my git repository in the production stage. I have a sub folder called TNT APL SVC and here are all the relevant config files if I want to run automation for this particular service in this stage in this project. So for instance here is my SLO file and in the SLO file you exactly have this configuration that you've seen earlier in the UI. So YAML where you specify as an SLI with a unique name SVCRTP95. Can also have a display name so that it actually looks nicer in the UI and not just using this acronym. So you can say service response time 95th percentile. But here's the thing you can specify pass and warning criteria. We also allow you to change the weight by default every SLI and I have multiple here. By default it has the same weights to the total score but you can change it. You can say this one here should have a weight of five because it's five times more important. You can also specify individual SLI should be a key SLI meaning if this one fails the whole evaluation should fail. A good example would be if failure rate is higher than one percent. I don't care how fast it is then it should be considered as a failed evaluation overall. So that means you can change the weight. You can change and mark one as key and you can specify pass and warning criteria. Now the question is what is behind this here? What is behind SVCRTP95? Well that is in a different file because it depends. In my case I use Dynatrace. I could also use Prometheus or Datadog and Urelic but in my case I use Dynatrace. Therefore I know when Dynatrace kicks in my integration it looks in the Dynatrace folder and in here I have my SLI YAML definition. And this now defines the actual query behind the scenes to query a particular piece of data. Okay so again I'm coming from this from, I'm just a sysadmin trying to understand this. The payload is where the payload is generated doesn't necessarily matter. Where the information that you get it from and how you basically, you're just basically passing that payload from tool to tool to make informed decisions. Exactly and the thing is and not only that, that's what we're doing in general. The orchestration of captain is we are triggering a task and in the end in the beginning the captain doesn't care who is doing the task. We just know here's a task that should be carried out and whichever tool is subscribed to do this task here's all the information you need to know. On the one side there's information and if I go let's say again to the deployment. There's information in the deployment event that says hey I need deployment triggered. So the task is deployment I want to trigger it. This is now part of the project delivery demo. This is for this particular service. This is for this particular stage. By the way I have some additional information for you. This is the artifact that should be deployed. So whoever can do a deployment this is all the information that I give you. Additionally in my case the Helm service is subscribing. So Helm then says hey I want to do it. So it says I'm starting the task so captain can keep track of you know what's going on right now. Additionally however every service we integrate here also knows that behind every project there's a git repository and in that git repository you have additional config files that might be there for this particular tool. So I'm going to the delivery demo project now. If I go to staging then again I have my sample projects here. My sample services like TNT APL service and now I see here I have my Helm sub directory. So this is now where all of my Helm charts live. And that means this particular point tool can then get the configuration file for instance from here. So it's a separation of process and tooling and the configuration that a certain tool needs. It's also fully version controlled in the git repository that is organized based on the way you run your automation. I love this so much. I like that you're using git T as well so that's pretty cool. Yeah I just use git for my demos because obviously you can specify any upstream git. Yeah mirror. Yeah but we call this captain in the box. So we have some of our tutorials where you just run and install. Right now in this here it just all runs actually in my case on my FK3S here. So captain itself is a container based event driven solution and it runs on Kubernetes. So when you install captain initially you install what we call the control plane. So that's kind of the core of captain the orchestration engine. So for instance what do we have here? We have the shipyard controller that is responsible to actually control the workflow that is specified in your shipyard. We have certain out of the box services like the lighthouse service. Remember the lighthouse is the one that is then evaluating all the results with the webhook service. To have the secret I mean you see all of this stuff right and then in my case everything is on one machine. So I also have installed some specific integrations we have like the low cost service to execute low cost tests. The dinatrace the helm service. We have a job executor where you can just let captain deploy a container as a Kubernetes job to do a certain task and then wait until the job is done similar to a GitHub action. So this is all there. But the nice thing for me is I didn't I don't need to care about how do I call a particular tool and wait until it comes back with a result. We've automated all of this for you. The only thing you need to do is you specify the process and you subscribe your tools to those to those events and and that's it. So we got an interesting question here from rockhound says can I define an RBAC which to which steps are available for which repo commit user for example or how do I live that. I think captain RBAC and probably all of you probably know even more than I do about the status. Yes. So there is an enhancement proposal for captain cap 61 and the screen shares. So basically it addresses user identification and authorization currently captain started as a developer tool where the basic approach was just to isolate it through network. And the inside once you get access you would have full access to in principle like an influx to be db works and many other projects. Now with captain gross. Of course there is interest to have user model inside and the current state that we have user identification so through open ID connect etc. You can identify users but the permission model is not fully integrated inside. So what you have now you have a bug for services. So for example you have a service which needs specific permissions like data trace integration service or let's say Jimmy to integration service for performance tests and these services can run with special permissions. So on infrastructure level you have some model and some isolation but not on the user level. So it's coming soon maybe in while I would expect the preview in three months or so but then it will take some time before all of this permission model stabilizers because all the gaps because it's a long story. So maybe six to nine months and we will have user authorization support. Awesome I actually put through that through that in the chat so for those of you following the chat I've been throwing the links in there as they come so start to cut you off Hillary but there was a there was a question that came up so I didn't I didn't I didn't want to miss the question. That's exactly what I was going to bring up was that question. Oh, okay. All right, gotcha gotcha. I've been doing it for the longer so I'm on autopilot so. One more I think there's one that talks about and you don't have to write any of the integrations I think this is something don't you have to write integration programming. Well, rock count also said the great thing is that all event driven based product that it's all event based products so it doesn't care you can just do anything as an event. Exactly the only thing obviously you need to like for instance and this is the reason why we have like, let's say a dinotree service here or a low cost service. We are sending events you can subscribe to events you will receive those events, but there are certain tools that might not be, you know that they don't have the capability right so if we give you we give you a couple of options. The first option is, if your tool that you want to integrate has an API that we can call for a webhook, then you can use our built in webhook service and that just as I showed earlier. You can subscribe here to an event and you can say hey you know what I want whenever a test is triggered in my let's say staging stage. I want to make a call to my testing tool and here's an API and it reports back when it's done so that's one option. If you have a tool they want to integrate and it doesn't support the subscribing to the cloud events we haven't understands it natively. This is one option. The second option is, we have the so called job executor service. That's the one I mentioned earlier, the job executor service. I was going to ask about that. Is that like a job for like a Kubernetes job? Exactly. Okay. Cool. Sorry. Because I was going to say that would be my catch all around be like if I could just create a Kubernetes job like I can wire anything. That's exactly. That's the silver bullet of automation. So captain itself is orchestration engine. But if you show you need automation, you have an option for that. Yeah. And here you can say right again through a config file you put in the in the git repository that again behind every project. There's a git repository and then you put in a job.config.yaml file and here you can say if this event comes in, then please run this particular image as a job. And there's obviously more options here you can pass in all the parameters and stuff. But this is the second option. The first one is web hook. The second is through this. And the third one is and you still have the option you can also write and develop if you want to a custom what we call captain service like these captain services. They are basically just containers that you just deploy it like I have here. Deployments is a pot like the I have a locust service actually here. So this is nothing else than a container that has a sidecar container which is called the so called distributor. This is a component that we actually deliver and that distributor is actively registering itself to captain saying, hey, I'm here. I can handle the test events. And then every time a test event comes in, then it just forwards it to that container that sits in that same pot and just calls a particular endpoint. So, so we have a couple of options. We call them captain services. And I believe if I do my, if I use my Google foods. Yeah, if I can Google correctly, then let's go to the dock. I've come in a couple of times that my real skill, my actual life skill is my ability to navigate a search engine. That's really. Exactly. So right a kept service. Here we go. Right. This is how you could do your own. We also have templates. We have a goal template. I think if a Python template. And that's exactly that me. No, it is probably already on it, but now maybe I'm faster than all I can I'm posting this into the, there we go. Yeah, you got it. You won. I do have a question from from rockhound again. Thank you rockhound. So, trying to remember the exact context of the time you enhance asked this question here she asked this question so are those events queued with am QP for example, or do you limit the maximum amounts of executors on the global level. Yeah, so what we internally. That might be too much of technical detail but we're using nets internally. So they, the shipyard controller remember the shipyard controller is there is the controlling instance that is always aware of in which state is my is my current sequence. And then it sends out the appropriate event so for instance when a new sequence is triggered like a delivery sequence. It knows based on the shipyard file the first thing is it needs to send out a monocode or triggered. So it sends that out and then everybody that subscribes to it will receive that event and then the shipyard controller waits until it hears back from at least one service that says I can do it. And then also the shipyard controller waits until all of the tools that have said I'm doing something actively also finished with a pass or a fail result. And then the shipyard controller takes over so the shipyard controller really manages everything. And in terms of limitations I don't think there's any limits on maximum and there's no no no limits on maximum number of of executors. Yeah, at least not not from an architectural perspective now as long as your cluster can handle that it's fine. One important thing to mention that captain itself it has control plane but its execution planes can be actually located on remote Kubernetes clusters. So for example if you need to have run heavy payloads one of the base examples is load testing as a quality gate which was one of the first use cases for captain. So with Jimmy there and with other services there is key six integration and you can basically afloat this heavy execution to another Kubernetes cluster and captain can just manage it for you as external execution plane. So it also increases scalability of captain as a system. That actually kind of anticipates one of my questions which was actually about to be scalability here. So, I guess, you know, knowing that it can, you know, have a singular control plane executing across multiple well execution planes. I guess the real question I have is sort of like multi tenancy in this model. For example, red hat is a huge org and we have a lot of, we'll call them cohabitating engineering teams who might not necessarily want to see everybody's pieces, but would want to use the same tool. So, yeah, I was going to ask about multi tenancy and scaling to accommodate significant numbers of teams. Okay, captain initially started rather as a developer tool where the approach was that you have multiple captain instances for relatively small steps. But now it changed and control plane evolved significantly. So now, for example, there is ongoing job to have control plane infinitely scalable also these features like high availability, zero time upgrades, etc. So basically control plane is being redesigned now to support not only scalability and high availability, but also support for basically nonstop operations, even if you need to update captain. So this is just ongoing work beats lending incrementally. So there were some changes in the recent zero to 14 release, some regressions, which has fixed by now. But yeah, this is one of key highlights of this year's development is actually having control plane fully scalable and fully cloud native when it comes to terms of a single system. So you should be able to scale captain much more than it was possible before. And then also, I think to add to this, kind of if you really have multiple different teams that all want to use captain, they don't want to see the other stuff or not allowed to. Then two answers right now, you would have multiple captain instances, you can obviously run multiple captain instances also on one. I think there was one question in OCP, we would need multiple instances so you can run multiple captain control planes also for that type of separation. And with with with role based access with our back, I mean, these are all things that are on the roadmap. So you can have team a logging in and only see their their stuff and team B only seeing their stuff. But this is something that we're working on. Yeah. Also, there are some quality of life improvements. So you have seen the uniform UI with a lot of configuration there. But obviously everything can be configured as code. And right now there is an ongoing work to have official github separators. Well, we have prototypes available for preview, but you will be able to manage captain fully, not just as code, but in models like github so that you can share multiple instances configurations between teams. Apply templates and also roll out upgrades incrementally, as we mentioned, this is your time upgrades just by having a number of repositories for your teams. Awesome. Awesome. And we what's funny is I can actually read that. So the the year I'll put this, I'll put this in the chat. Yeah, that was one of the videos and there's one of the captain tutorials we recently did. How to github with captain. What's what's really what what I always always tell people to is that like if you're a yidakanda. Yeah, okay. So the, you know, like what what you're doing with with our back right so now is like your ultimate flexibility right like if you want to leverage like our back and have like the traditional like workflow, or like do something like get ops were like all the interactions happening and get so that you actually almost just only focus what how your get is our back or your groups or your permissioning or your pull requests right like it's it's it's totally flexible right so like if you're going to do the get ops way, hey just use whatever organization, you know, like if you're in github right like just use that model, or like if you want a more traditional sense, you know, they're, you know, take a look at that. That PR there, I think it's a standing PR for now for for captain to support our back so here. Checking here to see if. Yeah, someone asked so this is this is a good this is a good question about and I probably one of the last questions here because we're almost at time, and I, I unfortunately have another meeting to go to. So, someone's asking about a open shift operator integration so that'd be interesting kind of thing to write about having operator integration with with open shift so not sure if there's a PR for that already or or or an issue but That would be really cool. So that's right. Yeah, like what is the Google foo on that. Yeah, we would know I mean that's exactly that's the github operator that that falls into the whole thing what we just mentioned earlier. Awesome, how do you get up with captain that's all operator based. Yeah. Cool, cool. Alright, so, so yeah so we're almost at time we're just about at time so again. Oh, look at that Andy already had the link handy for you. So yeah, so again, everyone. Thank you for joining Andy thank you for joining Oleg thank you for joining. It's been great. Hillary congratulations your first show your your your first official to your first official stream here. You know, again, have excited to have you here so yeah so any any last words for for Mandy or Oleg. Last, the second question about captain at the cube corn. And yes, we are going to cube corn. So you can meet us today we will have a project cars. We will also have happy hour happy hour on May 17 with some special agenda stay tuned for announcements subscribe to captain Twitter. You won't regret it. And also on May 17 there is a CD events a day, which is a conference again about all about events in CD etc captain will be actively participating there. So, see you there. Yeah, I'll put that. So I put the Twitter, Twitter link there for the captain project. So, cool cool awesome awesome so I'll see you at cube console I'll be there as well. You say that like you're going without me we're going together. Oh, my co presenter. So Hillary will also be there we'll all be there all four of us so if we're all four of us be there so catches there. So awesome so Hillary any last words. What what what do you think about your first stream. Love it hate it. Great. This is this is going to be hard to hard to top in terms of favorite streams. I say that now that could change but yeah this was a great topic and I really enjoyed it. And thank you guys so much for joining us. Yeah, thank you. It was the best stream so far for you because it's the first. You never forget your first I guess. So, so yeah so again, thank you everyone thank you for joining you thank you guys. And we'll see you next time. I forget what's coming up. Just look for Twitter, we'll have an agenda out so. So yeah again, everyone thank you for joining and like I always say, unless you can get it's only a rumor so goodbye everyone. Cheers thank you for joining. Goodbye. Thanks.