 Thank you. All right. Well, first of all, it's nice to see so many smiling faces This already tells you that I'm not a comedian otherwise it would be paid for better jokes My name is Andy Grabner or Andreas Grabner, but please call me Andy. It's always easier I always say that only my mom calls me Andreas if I did something rude My I'm originally from Austria. I actually live in Austria. So I also want to say thanks For actually allowing me to come here and I know a lot of international speakers that couldn't make it to kube-con I was one of the lucky ones that got an exception to travel here I've been in performance engineering for the last 20 years And I want to share a couple of things that I've learned in my life as a performance engineer Especially around application performance around distributed performance But not about just like how do you do performance testing or performance engineering? But really how we automate all of this and this is where captain comes into play, right? Captain is a CNCF sandbox project. We are hopefully soon reaching incubator status And yeah, I want to share with you what I've learned and how I hope I can make your life easier in case you're really interested in building and optimizing Or building better performing systems and optimizing your systems all the links and how you reach me and how you can follow up on captain on the slide Now one thing that I've learned over my last 20 years is when I do performance analysis I always look at performance patterns and I want to start with this before I go into the core part of this talk, which is how we can Automate performance analysis based on SLO service level objectives Why are performance patterns important because this is what I do in my life when I analyze Application performance, I typically look at things like distributed traces to understand the architecture, especially in distributed systems Hopefully, I mean for the people in the room distributed traces is something we are hopefully all aware of Who is looking at distributed traces? Yes So this is actually a small distributed trace It starts with the front end load balancer with some legacy systems some microservices some databases some third party some load balancers and so on this is a From a company in Germany called step stone I did a presentation with them a couple of years ago where they talked about how they moved To a container based environment and how kind of their architecture changed And the reason why this is important is you have to understand your architecture You have to understand how components talk to each other to identify Very important patterns that tell you if this architecture can actually scale or not Now the number one pattern that I always find and this is nothing new But I have to repeat it because I see it every day. That's the n plus one Query pattern what you see here is a distributed trace a single distributed trace from one of my friends who 15 years ago He started his own company building an e-commerce store very small for very small shops Over the years it scaled and scaled now his e-commerce shop is very popular in in europe Unfortunately, when you look at this distributed trace It shows this very interesting pattern. This is a trace for the search feature of his e-commerce platform when you search What his initial implementation did give me the number of product categories And then he's iterating through all the product categories and for every product He was then querying the database by looping Through a list of IDs and therefore ending up as you can see here with 26,000 database calls to execute a very simple query against the database To figure out which products in the product catalog match a certain criteria This was not the problem in the beginning when he sold his software to very small shops, but this became a problem with organizations that had thousands of articles in the catalog and the same pattern is also apparent in Architectures where one service calls another one. This is again distributed trace visualization Just showing you the cascading effect of one service calling the another recursively and this goes on and on and on This is not a good sign And it's very it's visually very easy to understand when you look at this that there's obviously something Maybe not wrong, but something to definitely need to look into now. This talk is not about more Performance patterns that you may already know or maybe not already know But I think performance problem patterns are very important because if we understand patterns We know how we can detect them visually and hopefully when we can detect them visually We can also figure out how can we detect them by looking at key metrics like how many database statements are called How many calls do I make to another service? What's the the payload? These are a couple of these problem patterns and also the metrics to look into if you want to know more I did over the last couple of years several talks where I talked specifically about all these patterns Giving examples the links on the bottom Please take us for people in the room. We of course take a screenshot, but the links are also in the slides and they will be shared so My point is understand your patterns and then derive your metrics from it Which really gets me to once we know metrics on how to detect bad patterns We can then automate the analysis and the automation Is brought to you today by captain. That's the cncf sandbox project that I've been working on for the last couple of years And the reason why we built captain whatever many different reasons, but it was it was in the end Really, we wanted to save people from building a lot of automation that they then have to maintain And so we wanted to provide a better Better alternative because you should not build and maintain large scripts for automating things So coming back to the use case of performance Because I would laugh more people to automate performance analysis into the delivery pipeline So hopefully you I'm not sure if you're running performance tests yet or not But what I see a lot of people are held back In including performance tests in every build why because many That are executing tests. They have beautiful dashboards. They're graffana dashboards or they're They're whatever load runner dashboards or whatever tool they're using and while these dashboards are pretty to look at It's really hard to analyze them unless you're really expert and within a second, you know if this is good or not So this doesn't scale Even if you already deploy automatically in test automatically, but analyzing manually is hard So what we bring to the table with captain is really an automated way of taking the knowledge you have in your head The knowledge that you put on the dashboard Which metrics are important and which patterns you're identifying on these metrics And automatically analyzing them for you Scoring them and then giving you an easy score to say This is solid and sound or we rather go immediately back and figure out why we make 10 000 database calls Or why we are consuming so much memory Okay So a little closer look, how does this work in captain now we I talk about SLO service level objectives the way this works in captain is you define a list of SLIs Right SLIs is nothing else than a metric. I have an example here five metrics response time query failure rate And and a couple more in the bottom even an open security vulnerability If you if you have a tool that provides you the number of open vulnerabilities a great metric to have So you specify what you normally look at Then in captain you specify your target your SLOs where we allow you to do two things either warning or pass Meaning you expect a certain metric to have a certain value Where we go a step further you can not only specify a certain let's say response time What does it say here response time should be faster than 100 milliseconds? Then it's green or 250 then it's warning if you look down to the test step login response time We're actually combining it you can say test step login this particular function that we are testing should be faster than 150 milliseconds but also should not Get slower by more than 10 percent to the previous good builds so we can also do regression detection automatically on metrics So you specify the metric you specify your SLOs and you specify your overall target So build number one comes along you're pushing your pipeline Captain automatically reaches out to your prometheus to your apm tool of choice pulls in the metrics Compares them against your objectives grades everything and then you get a score between zero and 100 We are normalizing This to 100 points in total you can specify weights if you want to but you can also specify individual metrics as key metric Or key sly so build number one looks good build number two comes along the developer may have made some changes What do we see? Oh, uh things getting slower plus we all of a sudden make More data more calls to the backend service like a recursive call maybe even and therefore we're getting penalized So we're getting less points and therefore immediately feedback That this build should not be go all the way through because we would end up making twice as many calls to the backend service Once this goes into production. I get immediate feedback. Maybe I fixed the thing build number three comes along It looks like I fixed the performance problems I brought the number of service calls back to one, but all of a sudden I've introduced a security vulnerability This is also part of performance right because if it doesn't secure it's not secure then who cares about the best performance So again getting penalized So that in the best case in the era build number four comes along everything is green again This is the way Captain works. This is the way we are allowing you to define what metrics what indicators are important What do you want to compare it against but instead of you doing it manually? We do this fully automatically for you A little bit behind the scenes with captain When you kick off the evaluation We have a so-called lighthouse service and the lighthouse service is triggered by the slo definition You define your slo's in the yaml that you put next to your source code in your source code repo You can see a little I know it's a little small maybe from the back But objectives error rate jvm memory number of database calls where you specify what do you expect these metrics to be But here's the important thing This is your what you expect this might be your architect your performance engineers your side reliability engineers But where do these metrics come from? captain is agnostic to the underlying observability platform You can build a service that then delivers the data and every service like for instance We have integrations with prometheus with dinadrys with neo load with refront You will then define or whoever is responsible for observability and monitoring in your organization You then define how do you actually query this particular piece of data? Like how do you get error rate? How do you get the number of database calls? So it's separated Objectives and how you get the data is completely separation of concerns because typically two different teams are responsible for that Okay, and when you trigger the evaluation, what happens captain reaches out through an event to that tool everything is event driven Retrieves the data and then as all the data is retrieved Pulse it back compares it against your objectives and then you get your scoring So this is the core component of captain this slo validation to give you one example one of our early adopters christian He was responsible for maintaining a git lab pipelines and his task was to also include an slo validation every time the git lab pipeline deploys And tests and now he's just making a call to captain where captain is doing all the evaluation for him Saving a lot of time because otherwise he would have built all this logic in his git lab pipeline himself We provided it for him and it works out of the box and in his case. He's using data dinadrys as a data source So this is the core component that allows us to analyze SLOs very easily and get this into existing pipelines, but the thing is Captain we wanted to solve a much bigger problem But the core is this data driven approach Where we always go towards slo's and analyze how your system is currently doing What we want to really solve with captain is the following thing a quick show of hands who is using Jenkins in your organization Okay, who is using git lab Perfect who is using power shell scripts for deployment, right and or These are all great tools, but the thing is they're all doing certain things pretty well But what we have observed is your classical pipeline has a hard coded set of steps hard coded integrations with your tools Even sometimes you have configuration in your pipeline. I've seen Jenkins pipelines also internally in our organization where you have Your yaml files your your your deployment definitions files hard coded somewhere in the Jenkins pipeline This doesn't scale this this becomes very ugly very fast So what we said we want to solve this problem because we want to make it like we want to make the life easier For people that need to automate more things into their pipelines. So what we do with captain We are ripping apart The hard coded process and tooling On the left side you can specify we call it ship in a shipyard file automation sequences. So you specify what Tasks you want to execute uh, you want to execute on the right side You have your two links completely separated the configuration all the configuration is maintained in the git repository And then we're using eventing to connect the two things together So to give you an example because we are talking about performance engineering here If you want to use captain for the complete sequence for performance engineering You want to deploy a new artifact you want to test it and you want to have it evaluated So you specify deploy test and evaluation without any tool context And in my example then I can say captain trigger performance in a certain stage and I give it some additional metadata like what image do I want to Test what captain then does it sends a cloud event Out to the world and says who is there in my environment That can deploy this particular artifact in staging with the blue green deployment strategy I may have helm doing it. I may have an argopipeline. I may call git lab. I may call a jinkins pipeline This is all event driven and our tools can then subscribe to the event to then do the actual action Once it's done sending it back next thing is test Captain sends an event. I need somebody that can execute a performance test in staging today You may use jmeter the nice thing about this is if your organization decides next next year you're going to a different tool You don't need to update any of your automation code. The only thing you need to do is You turn off the subscription for the event for jmeter and you'll just let your new tool subscribe to the cloud events. They're all standardized That's the beauty switching tools becomes a thing of a subscription to the event bus And then the evaluation that's the key thing. I explained this how this works earlier Captain sends an event out to your observability tool. Please give me the data That is specified By the performance engineers in the slo files in the sli files and then give me back the data and then Give me a thumbs up or a thumbs down And if it's done what i've completely forgotten the bottom right. It says notification tool You can have multiple tools subscribing to the same events. So for instance, if you want to get a slack notification after the evaluation is done, you just have Slack subscribe to the event and then you get it every time the test is finished So terminology wise we call the left side the definition of the automation sequences a shipyard file We call the tools that are subscribing to captain events the uniform Right because the captain The captain on a ship needs a uniform. That's the tools the tool belt We have all the config and the config repo that captain maintains And then we have cloud events and we're currently working with the cdf the continuous delivery foundation in a special interest Group to standardize these cdf events so that in the future hopefully every tool out there that you use deployment testing tools Will just automatically understand cloud events and then finally Nobody of us needs to maintain any custom tool integration scripts anymore Where you call api one and then you try to figure out what's the response How can you parse it and then they change the version of the tool and then you need to Change your integration. This should be a thing of the past So my point is that I want to make Is what's the nice thing about captain and especially when it comes to performance engineering You pick your slo's Your metrics and the objectives that you have Right if you think about how I started understand your application pick your slo's Then you are hopefully have some tests because without tests you don't generate any traffic Right. I mean captain doesn't take away the pain of you having to write tests, but we take away the pain That uh, we are triggering and orchestrating the test execution You can also use obviously captain for the complete end-to-end automation. So it's not just constrained to one thing So captain orchestrates all of this for you on the right side. You see all the tools We're not replacing any tool. We want you to use the tool of joys for a particular task There's so many great tools out there. I mentioned earlier git lab changes. They're all great tools If you have invested in this don't fold them away, but don't build additional Automation code. I always say friends. Don't let friends build their own automation Friends first first suggest their friends to look at captain I know it's a bold statement on the bottom, but it says 90 less custom automation code That's what we've seen from people like christian. I brought up earlier It's event and data driven. It's based on open standards and all the configuration is stored maintained and versioned in git Now captain in the real world. How does captain look like outside of my little uh, environment here? I am from austria It's also a famous song in austria, but it's also Why i'm very proud of this year Austrian online banking is currently completely automatically validated with captain So they're software organization. They're called rifles and software. There's also a medium blog post or a blog post on medium on the bottom Every time they push their new services the new builds for their Jenkins pipelines. They have captain automatically validate That release based on their SLOs Another cool example because we are talking about performance engineers here. I have a youtube video on the on the top It's a little small, but Mike Kobush, he works for any ic national association for insurance companies He's a big load runner expert and is now using also jmeter And he says when he runs these large scale load tests against the replications It takes him typically 30 to 60 minutes To analyze the results with captain. It takes him one minute. That means every week he has a couple of more hours to go fishing That's what he told his cto and i have him on record on this video another cool use case is from Vitality they are a large insurance company What they do they have a central software organization They're building their services and then they deploy it to the different countries So it's the same software but with different feature flakes that turn on different things that are country specific And what they're doing they're testing every single configuration every single tenant and then captain gives them Feedback on whether this configuration is good and what you can see here The quality gates are all red. So probably they need to invest a little bit more in quality They're really cool. It's building great software. Also, this was really nice I did a presentation a couple of months ago at a performance conference where tarash. He's a performance engineer at facebook He saw captain and then he made this really cool statement Captain feels like a reference implementation of google side reliability engineering and the side reliability workbook So that's a nice testimonial for especially the development team that stands behind captain So Next slide says let's wrap it up. I want to wrap it up with the slides But then I know I'm a little fast actually I want to then also show you a little bit of captain so that you see it's not just screenshots But my advice for you This talk is about optimizing performance based on slo's. I want to just recap how I started my presentation Understand the applications you're responsible for whether you're testing that you're a performance engineer Look at distributed traces understand the patterns. Look at my presentations. I gave all the links will be in the slides Next thing is if you know the metrics, then you can put them on a dashboard Use your favorite observability tool and put them on the dashboards right to see Do these metrics really tell you after a test that something is wrong and if they do Then you can convert these metrics behind the dashboard Into an sli and an slo definition in captain and then you can have captain Automate that analysis completely for you And then you want to bake it into your existing Tool landscape and have captain automate All of this fully well every time when you trigger a pipeline Also think about what I what I said earlier captain is not just there for performance analysis captain can really orchestrate any type of process Not only are we because we don't I always get the question so he's captain It's another argo is captain another Jenkins is captain. What is captain? It's just doing delivery, right? No captain Is bringing your data-driven Or orchestration of automation sequences whether this is around anything you need to do in delivery or in operations We have a lot of users also using us for auto remediation Meaning when there's a problem in production They then trigger a captain sequence Where captain first notifies people then maybe scales up scales down Do whatever whatever it needs to happen in the context of the problem then evaluates against the slo This is the critical part and based on that decide was the system brought back to a healthy state If yes great if not execute the next action So be reminded of that So captain is a lot of things I only had a couple of minutes here to talk about it today with the focus on performance engineering If you want to learn more I had a couple of presentations at different Conferences you will find a lot of material about captain and We're obviously welcoming every new community member with a lot of cool stuff on the captain website If we go to captain.sh then you'll find All the information you need to get started. We also have a live Captain installation a demo installation that you can click through which I'm going to show you in a second You can sign up for the slack. You can install the download it can go to github You can star us you can do whatever you like as long as you talk nice things about us So now I have I think three four five minutes right before q&a two more minutes at least Right. I want to make sure we have time for q&a you have 10 minutes Got 10 minutes even but again I'm gonna do with amazing demos So this is actually that the captain public demo live environment And this is the one you get to yourself. So if you go to captain.sh You can click on explore live demo that brings you to a little tutorial that explains all the projects that are on here We're working very closely with litmus, which was a great partner for us And it explains the different scenarios and then you have to link to captain dot public dot demo dot captain dot sh Which is this page here? I want to focus on really the litmus example What I have here is a so-called captain project and captain projects Are organized in a way that you have a project where you define what type of automation sequence Do you want to provide this in the complex of this project? And then you can have or add services to it and then these services Can enjoy the automation in this case. I have a hello service And you can see here that there was a build number zero one one deployed And what I see in the default view because this is the most important thing as a service owner What's the quality of the service? You see this heat map that I showed you earlier, right? And and you see on the bottom I'm not sure if it's if how easy it is to see but on the heat map on the top That's the score. I only have two metrics here as those that are evaluated But you can go back in history if I scroll down a little bit. I see the individual values I also see against which value it was compared to in case you decide you want to compare also with previous releases So that's really neat. Most importantly, this is just the thing that is most important Are we green? Are we yellow or are we red behind the scenes that much more happened? There was a delivery sequence and that delivery sequence Included first a deployment then a test then the evaluation. That means the deployment in my case was using helm The test was using a combination of jmeter to run some load and litmus So the nice thing about an event-driven system like this is you can say I want test to be executed and then depending on your projects depending on your event subscription You can have one or many tools do something in this case I'm running load with jmeter and at the same time litmus kicks in with with chaos engineering And once these are done then captain triggers the evaluation All right, so how does how can you define this right? This is a medium complex Environment actually let me show you this here Every project behind the scenes has a git repository as I mentioned so I click on it. It's all public available Initially when you create a project You start with a so-called shipyard file And this is really kind of your Sequence definition. So I have a sequence that is called delivery That does a deployment a test and an evaluation as you can see here There's no hard-coded tool integration because there are no we don't reach out to a tool with a hard-coded integration Captain will work as to it is and will send events to then trigger the right tools Bless you And then what's very important is captain also automatically keeps branches for every So we call it stage Well, you kind of you can group your your your automation sequences into so-called stages And then here you can put in the actual files That then the individual tools need because obviously helm needs a helm chart jmeter needs a jmeter script Litmus needs the chaos experiment for prometheus You need to figure you need to define the queries, but this is all organized in git All right, so coming back to my little example here If I go back Last thing I want to show you when I triggered This sequence you can either trigger through the captain cli command line interface or through the api for every step Captain sends a so-called cloud event Right, this is the stuff. We're currently standardizing in this case captain sends deployment triggered and then Whatever tool you've subscribed to that event can then say hi. It's me. I want to do it. Then they send a started event And once they're done, they send the finished event again These are all cloud events with additional information about what did they do How did they you know in this case? What is the url that they deployed my application to and then this information goes to the next set of Tasks that are executed so in the next event payload that captain sends out it contains The initial configuration change and then all the information that came in from helm And then it says well in this case I need to trigger some tests right So it's all event subscription based you have your subscriptions here on the on the so-called uniform page So you see which tools are actually subscribed in for a particular project for a particular stage All right, so I got I just got the sign. It's time to wrap up Because we have questions hopefully And yeah, what I want to say if you are interested in captain with a booth Uh with a captain booth in the pavilion, uh, I think tomorrow like tonight also tomorrow after two and on friday I also want to say a big thank you to diner trace diner trace is my employer And diner trace is the big sponsor of this project and I want to say thanks for investing in open source and giving people a chance to work on this um And yeah, if you are interested also in that we also have a booth down. Uh, I think it's g 11 But I now I want to open up for question in case there are questions if you have questions in the audience then over there So the question was to repeat. What's the difference to argo? Yeah Okay, so the question was specific how do we compare to argo analysis because they also pull in some metrics Uh, I don't want to say there's no overlap. But what we try to be is we try to be Not a tool just for delivery, right? That's what I wanted to say earlier with captain You can automate sequences that you need to automate in delivery, but also in operations And what we do we always pull in data through our SLI and a solo definition that then Allow the orchestrator to decide who we go on or not If you are happy with argo and your argo analysis And that's fine with you. Then it's good, right? If you can with argo Orchestrate the deployment your tests your evaluation and then you notify and if you switch you want to switch from Let's say one tool to another and you are and that's fine Then then stick with it what I showed you today is a separation of concerns between process definition You can define whatever process and the tooling is separated the magic behind the scenes is that they are connected through events So you can at any point in time Change the tooling without you have to think about what this means for the process definition Because the process doesn't change just because you change the tooling Also think about my test scenario I have a test step because that's what I want to do In one project that may do functional tests in a second that may do load tests And maybe next year your company says we need to add security tests to every process every project The only thing you have to do is subscribe to the test event And then automatically your security tool kicks in So these are some of the considerations that we have put into the architecture of captain And it's not constrained to delivery even though we talk a lot about delivery Automated remediation in production you define your sequence what you want to do When a problem strikes in production, right? Do you want to first notify people then you want to I don't know scale up Then you want to validate then you do the next step So you can define we call them remediation runbooks But they are orchestrated in the same way as you saw for the delivery use case Yeah, but see it for yourself, right? I don't want to say It solves all of your problems if you're already happy with what you have Yeah, so the question is what's the best practice for microservice architectures Do you rather try to test microservices individually or do you deploy everything and run everything you want? I think the the best practice is Do both Right If it's possible that means if you have a microservice and you can run it in isolation That means you can mock a way the dependencies or at least you can deploy the environment where all the dependencies are Also just there and you can run tests against it. Then this is the ideal state Because remember my number one problem pattern I had today It was the m plus one query problem if your microservice has always made one call to the database And all of a sudden it makes five calls to the database Then you don't need to deploy the whole app in a big environment because you already know Something has changed that will have a mega a huge impact later on So these patterns they can most of them can be found Just by testing it in isolation as long as you have the data of How many like the test call that goes in into the service and the calls that go out? And this is what distributed tracing is so great because it shows you that Sorry, yeah Yeah, yeah Yes, so the question is can you share tests with yes, I showed you earlier that that for instance the jmeter files that were Stored and gipped So you can either store them on the main branch then it's available for every Service in every stage you can store it on a branch level then it's available for every service in that in that stage Or you can define it on the individual service per stage. So it's like an inheritance level that we have to Set again New relic if captain supports new relic And bit bucket so I would laugh if captain would support new relic. It doesn't do it right now We have a however a git issue out there ready for new relic for data dog Uh, and I think also in stana. We would just would you know, it's an open source project That's why if you have new relic if you have data dog or any other data providers Then you know join us and uh and and help us implement. It's just it's very simple We have templates for building these sli providers like we built for prometheus and dinotrace But I think as far as I know today, we don't have one for new relic, but we have an open feature request I think the challenge would be like actually implementing the subscriptions and getting a tool to integrate that What does that actually look like is there libraries to do that or so? Yeah, um, maybe I quickly So the nice thing about this is so captain sandbox, right all of our all of our New services starting captain sandbox Right, so we have jankins library artifacts blah blah blah Stint is so all of the integrations if you want to get started. There's this as a service a captain service template So this is just a gold template that you clone and then you just put in your Let's see that's a lot of I mean it's this is what's subscribed to the task Exactly. The only thing you didn't need to do is you just implement the event handler, right? So and then all this runs in kubernetes Exactly. So that service you deploy that service in kubernetes Let's say on your target server and that and in the configuration of that service you say I want to subscribe to this event And then this really then becomes the service that then obviously calls. Let's say the new relic api Um, or whatever else you want to integrate now We are currently working with the cdf foundation in a special interest group to completely standardize all of our cloud events Which means once the standard is through then companies like new relic and others if they just built this natively into their product Right, they open up a cool new opportunity because we all of a sudden standardized on events That everybody built many benefit from I think this was the sign For me, I'm I will be here as I said a little longer for the people in the room and for those that are online feel free to connect with me through LinkedIn twitter or also on the slack channel Very happy to follow up with you. Thank you