 Perfect. All right. Let's get started. So I'm a tears Louisville. I work at polar signals We do continuous profiling and with me virtually today is Nadine feeling she works at Grafana labs as a UX designer and We want to talk to you about SLO based observability for all Kubernetes cluster components today Let's see if all I actually had some learnings during making the slides, but yeah, let's get into it So this kind of what we are trying to do we have all sorts of of components in a cluster and we want to be able to measure and kind of objectify the Yeah, the the individual components availability targets So first of all we're going to talk about the fundamentals of SLOs. This is a beginner talk So I want to quickly talk about that before talking about pura a bit more and then how to create SLOs with that and Obviously look at cluster components quick demo and then what the future holds So what are SLOs? Imagine you go to the GitHub page of the project There are many requests made but let's focus on the initial request That loads the HTML and you want 99% of these requests to succeeds within five minutes So that's kind of a very Yeah, usual SLO maybe within one second might actually be what you are Even more focused on on doing but yeah, let's let's say five minutes so that's kind of an example and SLIs are the underlying oftentimes just metrics But we use these two to measure The amount of errors and successful requests and then compare those against the total amount of requests that our system gets and that Already brings us to what availability is So we kind of take the amount of successful requests and Divide them by the total amount of requests and we get something like 99.3% of requests are are successful Right, so like looking at that from from a request-based system You also have like batch systems where like the operations might be in a in a different way not request-based, but yeah Like you would also measure the successful events there against the amount of total events And then the more interesting concept on top of that with SLOs is error budget So let's dive a bit more in detail into that because I often find it's it's slightly confusing when starting out so We have always like the hundred percent and we never get to a hundred percent if you want to try to achieve a hundred percent good luck kind of impossible over weeks and and months of of time but let's say you have a 90 percent objective And the inverse is always kind of the arrow budget So you have a 10 percent arrow budget if you have 95 percent 5 percent 99 1 percent and then even if you go above that you you get like a point 1 percent arrow budget for example so only point 1 percent of requests in like a month can fail and The the interesting concept is you have these like static kind of objective and arrow budgets Metrics, but then you also measure the actual availability and we take the availability Substract the objective from that and we get with this example. We have currently five percent of our arrow budget left and we divide that by the arrow budget We actually kind of want to work with so right now five percent of five percent is a hundred percent If we have a five ninety five percent availability subtract the ninety five and then have five percent left Zero sorry zero percent left and zero out of five percent is actually zero And then more interesting if we have a ninety eight percent Availability subtract the objective. That's three percent out of five percent arrow budget We kind of work with we actually have three out of five is sixty percent arrow budget available and then if we are actually below our objectives on ninety three of The ninety five percent objective. We actually have a minus two percent Availability and Overall that makes a like a minus forty percent Available arrow budget so the arrow budget can actually go into the negative and then like if you have a ninety nine percent Objective it can actually go to minus nine thousand nine hundred percent. So keep that in mind So now I hand over to Nadine and she's going to talk a bit more about pura I'm Nadine. I'm your ex designer at Grafana Labs, and I'm also maintainer of the pura project. So what is pura? Pura is an open-source project that aims to make a solos with Prometheus manageable accessible and easy to use for everyone so in early 2021 there were no projects in the Prometheus community that were easy to use to get started with solos These tools were based on Jay Sonnet and YAML mostly So therefore it was clear that something had to be done Matthias and I are maintainers of pura and We have many Individual contributors so it has an open governance like Prometheus to ensure maintainers from different companies can Participate and influence the project. So contributions are very welcome How did we start with pura and what was our approach? What research did we do? We started with greenfield research based on interviewing the community We recruited participants via Twitter to engage with the community as early as possible The participants were mostly Software engineers and SREs from small to big companies as well as beginners and advanced users After some conversations about their experience with using solos We started testing prototypes to validate our assumption After these interviews, we found that participants had similar challenges with solos and Prometheus Even advanced users struggled to configure alerts and mix up things or Might mix up things. There were some questions and insights like what is an error budget? My team deals with alert fatigue, but it's hard and risky to get started with as a low based alerting and How do I configure a multi burn rate alerts? So based on our research, we ask ourselves for example, how might we reduce information overload and how might we show relevant and actionable data So one principle of UX here, which is called Hicks law fits in perfectly for our use case Which simply underlines more choices to take Increases mental overload and increase time So we aim to avoid overwhelming users to provide focus by highlighting relevant data We want to minimize choices and do the hard mark for our users As you can see on this detail page, Pyra highlights service insights and provides a very actionable content like showcasing the objective availability and remaining error budget as well at first glance Additionally, Pyra shows the red graphs request errors and duration to get insights into the current underlying state So also great news here Multi burn rate alerts are configured for you The table will show you what is firing and how critical it is Finally, Pyra also has an overview page that shows all SLOs and their current state It's possible to sort and filter these to get a better overview So now I hand over to Matthias who will tell you more about how Define how to define SLOs with Pyra and thank you and have fun at keep con All right, perfect. Thank you Nadine and actually hello. She's watching from Berlin Um, yeah, so how can you deploy pyra? You can deploy it on on kubernetes There's kind of like a controller slash operator that uses custom resource definitions that make high level SLOs available and then Create prometheus rules for the prometheus operator to then load that into into your prometheus But we also have support for docker and and kind of like just running a binary for example System d where everything is based on on disk and then prometheus will read the the files from that folder So how does it look in in kubernetes? You have the pyra api binary or container and then the pyra kubernetes Container as well and that kind of like takes whatever SLO objects you you give it reconciles that against the kubernetes api and and and creates these prometheus rules and prometheus then again, yeah reads the prometheus rules into into its configuration and pyra you can either go to pyra directly to get the ui and that will in The background query prometheus or you can always kind of like escape the ui and just like jump into prometheus from there To make it even easier to like modify the queries And yeah, like that's pretty much the deployment nothing too crazy It's it's running one container you configure it The api endpoint of prometheus and the and the back end in our case the kubernetes Controller and then the same for the kubernetes back end We say we want to run the kubernetes component and there's something called generic rules, which I can touch on a bit later But nothing too crazy here either Um, we support prometheus obviously, but it's also possible to use the panels and cortex for example They support the same query api endpoints And I was too fast we also Should be able to interface with mimir or m3 or timescale because they all have the same apis Like prometheus does So how do you create an slo? You define this like again service level objective custom resource you give it the target the time windows are over 14 days We want 99.99 succeed and you give it a metric for the errors to kind of select the time series that we would count up To get to the errors total and then also for the just total amount of requests or events that happened um, if you don't have different metrics you can actually use labor mattress to match all the um, yeah 500 uh code lgp requests A bit more interesting you can also say just for this specific handler So I guess we all have apis that are very unique and specific to what our users need So you can say just for this handle. I want that very specific slo And then we can also do something like match all the api Handler starting with a slash api And then what we actually want to do is not like Mesh them all together and and have an slo for all of them at once But we want to like have them individually. So we use grouping By handler and that's kind of like the sum by that you can write in prom qa as well So you actually get all the different Api slo's for the different api and endpoints by writing one configuration and they are tracked individually Underneath you just get recording rules Again, it's nothing that pura does is kind of like on the alerting side and and recording rule side It's all taken care of by prometheus, you know, it's reliable and robust And that was kind of the point of that So again also for the alerting you've got like the multi burn rate alerts and you get the four alerts per slo to Alert you on the different severity So two critical ones to alert you right away when you need to be woken up And two ones that just can like go in a ticket and you can look at it on monday or the next day So we're here for kubernetes. Let's finally look at kubernetes That's kind of the architecture and we have the control plane on the one hand side Monitoring on the other and in between kind of the notes means are running on the same notes But that's what the architecture kind of looked like and first of all We don't probably and that's one of the learnings I had during making this We don't want to have too many slo's some maybe I should change the title to Four kubernetes cluster components, but not all of them So make sure to have slo's for the things that you really need as your team Maybe like the underlying disc latency or something isn't too important for your team, right? But yeah, let's look at the kubelet first The kubelet is the primary note agent that runs on each note And it just makes sure that the parts that it should run are running and healthy So we can look at the metrics that the kubelet exposes And we get the runtime operation errors total and we get just the operations total And using those in the slo custom resource definition We can say 99.9 of these Operations to make sure the parts are running should succeed within two two weeks and we get our slo So one of like four million Operations error and yeah, we're doing good on the kubelet runtime part The kubelet also talks to the api server to even know what it should be running, right? So we can look at the rest client requests and say for all the 500 requests These are the errors and then for just everything else is the total amount of requests And we put them in and we say 99 of them 99 percent of them should succeed within two weeks And again, we know how good the kubelet is able to communicate with The api server and we know the kubelet is actually Knowing what what what it's supposed to be running Then another component kubet proxy and that's like one of the examples I'm not sure if you really want to put an slo on that but here we are I said on all components So the kubet proxy is just the proxy that runs on each node and configures the network So you can like reach the service back ends Through tcp and udp and we want to be able to configure that In like 512 milliseconds So we say we take the histogram of the Yeah, 512 milliseconds and compare it against the total amount of of syncs we had Put that into the configuration and we get kubet proxy sync rule latency Which is great and it seems that this this cluster I got the screenshot from is also doing quite nicely And yeah, I think that's the most important part the kubelet is api server So let's look at that in more detail six scalability actually has official slo's for The api server So that's kind of interesting and I don't know how many are using kupromethias or the kubelet is mixing May show of hands like who has seen this dashboard Okay All right quite some people. Um, so yeah that was using six scalability slo's already in the background um The one problem was that I went all bananas on this and like tried to combine like latency and errors and everything Together and it was really hard to debug For people are the latent like the rick has to slow or are the erroring etc So now I think um what I want to like do forward going forward in kubet Prometheus is actually separated by error and right and then the latency by cluster namespace and resource scope So, you know what's too slow right away And yeah kubernetes api server validates and configures the data of the api objects And has the arrest api um to to do so So the api server um has the request total metric and for read request We want to like just have the list and get uh verbs to to alert on and then again 500 status codes Compare that against the total amount Put that in the service level objective custom resource definition And here we are now we know how how well we are doing on on the api server read resources responses Um, so that's like for pods and all of these objects right how good we are in reading those and then for writing Kind of similar just post put patch and delete verbs Yeah, and compare that against the total amount of push A post put patch delete Request and we we say again two weeks 99 Percent and we get the same for write request So all the like controllers and operators talking to the api updating and reconciling The the objects are are doing quite quite well as well And then yeah latency I think we want like kubernetes to be able to reconcile quite frequently and and correctly And for that we want to have a latency as low as well. So for reading We want to have the cluster No, the resource scope metrics succeed within 100 milliseconds. So that's why the less equal 0.1 Seconds is there and we want to compare that against the total amount of of requests we had in in the histogram And if we do that it's actually outdated. So, um sick instrumentation was active in the last Like half a year and there are actually new Metrics which are just now have the slo Prefix in them and they don't Count the webhook requests into them anymore. So they're just a bit cleaned up but we can use these now and Again, put them into the custom resource definition and all of a sudden now we know Okay, so in this example right here We had like 4,000 Requests that were too slow out of like 2 million. So overall that's that's quite acceptable for for this api server as well And yeah, let's look at the demo. So I mean the same is true for right latency and and so and so forth But it kind of keeps on repeating as you can see Going into Into this cluster And actually I wanted first to mention this is a typhoon cluster running on digital ocean And with typhoon you can like Get to everything you can like look at at cd and you can control everything kind of like a kubernetes hardware So it shows that running fedora chorus on digital ocean and deploy kube prometheus on on top The pyra project, but yeah, let's let's look at this and we have like api server core dns at cd different components and and down here even like Prometheus endpoint so you even know if the if the prometheus monitoring stack is doing doing well and nicely So yeah, let's look at the resource read latency and over the last day We can see all of that is doing quite nicely We had a little spike apparently at like 10 a.m. Down here. So that was fine But yeah, we can see we're on from 60 percent error budget to 59. So whatever it shouldn't even alert and then One other thing I want to look at our core dns response latencies and I don't know who paid attention Just really quickly before starting the talk I went into the deployment and put a limit of 10 millicores in there So now we can see core dns is doing quite badly and you can see we have an incident But and that's kind of sad for the demo. It's not too bad. So usually if it was bad enough one or In some cases even all four of these multi burn rate alerts would alert you depending on the severity and how quickly So the multi burn alerts alert on how quickly the error budget is burning So if you like just burn Pretty slow over two weeks and you might like End up with like zero error budget left Then that is just a ticket and you can deal with that because it's like pretty slow but if we're going really really fast over the course of like a An hour like looking at an hour For example, then then we want to like page pretty quickly so that like somebody can can take a look and troubleshoot Yeah, I think that's it for the demo, let's continue with the slides So both pura and all these slo's I have like on my machine And I want to get that into core of kubromethias not only have that as an add-on So hopefully in the next couple of weeks or months Once you update kubromethias, you might get this out of the box And let's see how we can can work with the kubernetes mixon To to make that happen as well Um, so what's next for for for pura and all of this again I want to put this in kubromethias So you don't even need to do much to get like some something as a baseline But yeah, like version 0.5 is about to release I need to write some docs but being at kubcon. I didn't really have time for that Histograms are shown for the latency slo's now You could see that like looking at coridine as just now Autory loading the detail page so the graphs are updated and we now also have an Optional grafana dashboard with the select generic rules flag you can get that So you don't even need to get out of grafana and you can look at some some slo's in grafana directly Not all features are supported, but I think it's good enough for a start And we also and that was kind of interesting Some companies were running this and the underlying metrics were gone So the slo's were gone and all of us hadn't they were thinking yeah, everything is fine But no actually they they didn't really have the metrics to even do a do alerting anymore And somebody contributed an open shift example So if you're running an open shift, you might want to check that out For the people who kind of like Really paid attention. We want to make it easy, but I actually only talked about yammel, right? And that is something that I'm like for the next release working on also on my machine and on a branch on on github There's an interactive React based editor where you can put the label selectors in for the time series Click on preview and you get like the current state. You can see what's happening In the in in the past to inform you about like your slo on on how it's currently doing Probably you want to still do the math and not like depend fully on that as bern would would say from promitius team He's very strict about and I think it's right But just to get an understanding I think it's quite nice And then also the table with the multi burn rate alerts I think those should have should have graphs so that you know how high the burn rate is for for a firing alert Try pura. We have a demo running on demo. pura.def You don't even need to install it and if you put a slash grafana, you also can see the grafana dashboard Obviously the sre books are great. So read those and grafana big 10 podcast I was on there like beginning of the year and we discussed together with tom wilkie bern ramstein from grafana and matt ryer slo's in detail all sorts of interesting conversations happening So check that out. It's definitely not outdated And yeah, that's it. Thank you Also nadine says hi and thank you and follow us on twitter and yeah, thanks I don't know how are we doing on time? Yeah, we can do kure. Oh 10 minutes. Oh my god. I was quick. Perfect. Let's go Hey juka, thank you for the introduction to pura. That's very interesting Um, and uh, how do you envision Pyrrha being integrated into the workflow of an engineer? So do you see Me going to pura after seeing alert on slack or should I have it on a on a big screen on my on my development room? Yeah, how do I get in touch with pyrrha and where do I go from there? Right, so Like these days. I actually don't open dashboards or pura quite actively myself only if I find myself like I don't know like having deployed something where I think like the latency for example improved quite drastically and we We tend to to do that quite a lot And then and then I look at it and probably oftentimes just like an hour after that to like see like the trend on like the p99 latency And yeah, otherwise like when there's an alert, um, I don't really like have it open Much much of the time. Um, it should be just a tool in the background and get out of your way Where do you get from from from pyrrha? Like you can like always like open it like and and prometheus and like modify the the the prom kill So I don't know I can quickly show down here. We have like for example for this. Um, you click on it and it opens Yeah, I don't I haven't forwarded the prometheus container, but yeah, you get like the query in the prom kill editor Right. Perfect. Thank you. Um First of all very cool project. Um, thank you I I was wondering how would you Include this into for example a multi-stage delivery pipeline like can I react based on? What's happening happening on my slo's right? How would I interact with pyrrha to Take a decision and maybe trigger a rollback for example Yeah, I mean If if there's a bad rollout and and it's firing Or you might be curious after the deployment You already see a spike in in the latency or error but error rate Then then rollback. Um, there's no like integration for for deployments themselves I think captain is doing something along those lines So maybe check that out for like deployment specific things I can see like maybe reaching out to like not maybe but I I definitely want to reach out to agro city For example to like integrate with with them as well And yeah, that's that's kind of where it's at but pyrrha is really focused on just like the Everyday operation that like constantly running systems have and and not too much on on individual deployments themselves, right? Yeah, I'm actually familiar with captain and and they have their own way of defining slo's And this is yet another way to define slo. So it feels like everyone is Trying to reinvent the wheel here, right? Yeah, like like we like I've been also like attending the open slo Calls quite early on and I think that there are some interesting concepts But this is very pre-media specific like the grouping for example like that wasn't in in the in the open slo spec So like for now it is its own thing But yeah, we can like import open slo in the future for example I think sloth is doing that as well. So there are possibilities of integrating with the larger ecosystem always Yeah, I mean it's open source. If you want something feel free to open the poor request That is actually a bit funny because what I was going to ask was the open slo If I already have my slo is defined maybe in open slo, but I'm using you know Prometheus back now it might be actually much faster to instrument moving forward like how would I you know switch is there Can I use the same specification because I know open slo specification is Very complex for good reasons But the simplicity if this is also really nice. So what are the plans moving forward? Are you guys talking? Yeah, like like we don't we only have rolling window slo's for example Like not like beginning of the month to end of the month things like that So I think like if you want really like super like customizable things then like go with open slo on some some occasions But yeah, we are not like Or let's rephrase it. We are open to like adding things like that to pura. I just didn't get to that and There also was no one opening issues asking for it yet. So if no one is asking I'm not like inclined to just put every feature in there and like make it the big system that it Doesn't even need to be maybe so if you want anything like that, please open an issue and we can have a discussion over there It's been wanting to ask since the beginning Well, you laugh because I was going to ask about open slo so I'll ask something else instead What how do you recommend like testing? Actually writing these like if you're going to make changes and make you know modifications to your slo's Do you have any sort of workflow for? You know verifying you're getting what you want. I guess, you know, it's kind of tricky because they're slo's but just curious Yeah, I mean that's that's a good question And I think it's less of a technical question more of a cultural question I know for a fact that the implementing open slo book For example, it has an entire chapter on that Yeah, I don't really have a good answer on that. We we kind of discuss at pola signal it's like when when we We start out like with a high with a low Objective target and then we just like increase it once we like hit certain milestones for example or like a foreign new quarter We we want to change something and that's how we how we do that. But yeah, I I don't know like whatever fits your organization. I would say mostly a cultural question So the question was how do we like change the open the the slo's at pola signal We we just come together like in Engineering meetings and and then see review the slo's of the past quarter for example And then for the new quarter we we now have new slo milestones or like objectives kind of driven by what we want to build as as features kind of really Saying okay now we want to do like this experience for users and and then we kind of like defer The objectives from that and and and yeah, then it might be it might be looking pretty crazy Right now like for q4. We have a have an slo that like it's pretty challenging But I think that that's also good and and we can we can try to make like the user experience work that we want Have envisioned And yeah as engineers. We are kind of like paid to to try to get there, right? So yeah Yeah Two questions. Um, just first one's easy. Will your presentation be available on? schedule or The the if the sites are going to be available there. Yeah. Yeah, I'm going to upload it there And I think also on speaker hub where you can like find it and If the recording is out, I actually put the the video link it from the read me on the pyro project We have like a lightning talk from from Prometheus stay there as well Thank you, and then I mean this is a loaded question But how do you get that? You know you mentioned culture change and you know, we have all we're collecting all these metrics and everything like that How do you start the discussion with slo's? Do you have some kind of? Best practices to kind of get the conversation going Right. So so is it is it like on collecting them the underlying slo's and metrics or is it like how do you like come up with slo's here? Yeah, the slo's. Yeah, okay Yeah, I mean like again like we we kind of have like g rpc servers or http servers and and we instrument them with client go for example or like the go g rpc Prometheus project and we kind of get metrics out of that for free but And and then it's really easy to just slap a nestle on that But I think what we actually want to do is kind of really sit down and think what is this service doing right like what is Like the underlying intention Of that service. Well, like how how do we define? If it's doing well and good and and then take it from there and maybe like oftentimes we don't even Have the metrics available yet. So we just have this right now as well where we like for q4 We need to like first of all put like the proper metrics in in place to to start measuring the the the service level objectives Yeah, so we have time for one more and then we can go to that way If somebody wanted to contribute to the project, do you guys have like weekly meetings and then maybe what areas Do you need the most help with right now? Right, uh, we don't really have meetings because it's just the two of us and then random people showing up in And issues and poor requests But if you want to like have a meeting at some point feel free to reach out But yeah, like I think like the the github Project has like a project board for version five most of the things I've done now Or zero five because I'm about to release it But I I'll try to like put in more features that for for the next version We want to hit and then yeah open issues. Let's discuss like asynchronously on on there So, uh, yeah, everybody can participate and follow follow the conversation as well You're welcome All right, that's it. Thank you everybody for showing up again. Yeah, great pleasure and see you next kubcon. Hopefully