 Runbook automation systems for Prometheus and Kubernetes, and we have Natan Yellen from robust.f talking to us. Hi, everyone, can you hear me? Yeah. Okay, thank you. So I have some live demos planned, but the internet here is a little bit iffy, so I'm gonna do my best, but bear with me, and I've tried to cut them back to the minimum I made to make it work. Okay, so a little bit about myself. So I'm Natan, and I wear many different hats, and I guess this thing doesn't necessarily work. Okay, so I'm gonna stay a little bit closer. Yeah, okay, so I wear a bunch of different hats, and I was a developer for many years, also an open source contributor for 15 years now, starting with GNOME Linux. I'm co-founder of Robusta, which does Prometheus-based monitoring for Kubernetes, and I do live DevRel stuff, so here's an example video I did with my grandmother. DevOps with Grandma Sue, where we talk about Kubernetes and explain in very non-technical terms why it matters. So if you need to explain to your spouse or to other people what you do, then that's often a good reference. Now, outside of work, I'm also a crazy plant guy, so I grow a lot of different vegetables, mostly tomatoes, grow a whole bunch of different stuff with my wife, and today I'm going to be talking about Runbook Automation, so I'm going to be talking about what Runbook Automation is, why you need it and can make your life better, how it works under the hood, and assuming that the internet is okay, then you'll see live demos, and if not, then you'll see screenshots. So, wait, before I go on, I'm just curious, is anyone here using some form of Runbook Automation in production today? Okay, so I'd love to hear from you guys after, as I'm also curious about what you guys do, and I also just want to see on the aspect ratio that this isn't cut off. Can you see on this? Okay, yeah, so one second, let me see if I can fix this. The aspect ratio is wrong. Is that better? Okay, perfect. So, who here knows this alert? Has anyone seen this before? Oh, yikes. Okay, yeah, I hit the button on this. Has anyone here seen this alert before? This is an alert called Cubepod Crash Looping. It's part of the default side of alerts for Kube Prometheus stack, so super common alert. And what Runbook Automation is, is taking an alert like this, and then adding on context. So connecting that alert to dogs, to graphs, adding on knowledge about why an alert like that is firing, and sometimes even being able to fix an alert like that automatically. Now, when we speak about context, I just want to give a non-technical example to show why context matters so much. So, can anyone guess what this object is? Come on, guys, give it a go. A security camera, what do you say? Wally, oh yeah, looks a little bit like a robot, like Wally, some kind of cute robot, or something from the terminal either. Okay, so to explain what this is now, I'm going to add on context, and this just shows how powerful and how important context is. So this is an object that you use when you work from home, and your cat starts jumping all over your desk, and you want to occupy your cat so that you can actually get some work done. And as soon as you see this object with the context, then it suddenly becomes apparent what this is and why it matters. But when you see an alert out of context, or when you see an object like this out of context, then you really don't know what it is, and if it means something or not. So that's the goal of Rumbik Automation, to add on that context. Now, I recently asked on LinkedIn what context we should gather for that I showed you before. So I asked, I'm putting this in terms that no more humans use now, not about context and alerts. I said, what's the first thing you do when you investigate this or that? And that's really the same question, but for used in way people actually think. And I got a whole lot of different answers. So being the data-driven guy that I am, I went over all those answers and I summarized it, and I counted up how many different times I got each answer. And let me go through those answers. So two people said you should bring the DevOps team. One person recommended running this command. Does anyone know what that command does? Okay, so for those of you who aren't in on the joke, that uninstalls Kubernetes. But then we got a bunch of serious answers too. So four people said you should run kubectl get events or look at the Kubernetes events, which are events submitted by the API server that give you useful diagnostic information. Four people said you should run kubectl describe, kubectl describe poison events and some other stuff. And the overwhelming majority, so that what you wanna do when you have that alert, is you wanna look at the application logs. So you should go and fetch the pod logs and then you know why the pod is crashing. So what we're going to do today is we're going to take that alert and we're going to automate that process. So we're going to go and we're going to take that alert and automatically pull in those pod logs and you'll get it right there in Slack. So as soon as you get that alert, you'll have the context on why that alert is firing. Now this is a simple example, the one that we encounter every day. But you can also apply this concept to far more advanced things that we'll see later on. So let me give you a teaser for what this will look like. At the end you're gonna get an alert like that in Slack. And there if you look carefully, you can see there at the bottom, if I know how to use this pointer thing, maybe not. But you can see there at the bottom like the regular other labels and everything. And essentially we're taking the method out of it from that alert and we're taking the labels and we're mapping it out into a Kubernetes object. And then automatically we're pulling in all that context so the person who gets that alert doesn't get an alert and it's like okay, here's another good duck. But now you have the context and you can see why that alert is firing. So now let's talk a little bit about Prometheus other than architecture. So the way that it's normally working Prometheus, and I'm sure there are people in this room who are far more experienced and expert in this than I am. So if I'm getting anything wrong here and I'm sure I'm giving some simplifications here then feel free to come over to the afterwards. But what we're going to do is, sorry, the way that things normally work is Prometheus issues of alerts. Those alerts get forward to the alert manager. And the alert manager has more advanced logic on top of that. For example, it groups the alerts. It has a grouping interval. It notices when alerts are resolved. It does very different things. Actually the resolving happens in Prometheus. But the alert manager has some logic around that too, I believe. And then finding the alert manager takes that alert or that site of alerts if it grouped it and now it forwards those alerts by webhook to the destination that you receive alerts in. So you get alerts at the end of the day in Slack or in MS Teams or in all these different destinations. Some recently I had support to robust actually for Cisco WebEx. So I guess there are people getting alerts in Cisco WebEx, which kind of surprises me. So you get the alerts there in Slack. And now we're going to change the architecture a little bit and we're going to add in an additional component. So one way that you can implement Runbook Automation, which is very popular, that's how we're going to do it, is we're going to add in an extra component here in the middle. So you have alerts that come from Prometheus to go to the alert manager. From alert manager, they get sent. Now, not directly to Slack or to MS Teams, they get sent to the Runbook Engine. The Runbook Engine takes that alert, it pulls an extra context, pulls in that context about what this alert is, what it means, maybe why it's happening or even not made to fix. And then it sends that alert plus the extra context now onto the final destination. So that's what we're going to do. And this is a good time to say what Robusta is. So Robusta is an open source project on GitHub, of course. And it contains two main parts. So the first is the Runbook Engine. That's the engine that takes these incoming events and then adds on the context according to a bunch of rules that you define in YAML. And then the second part is we've gone and we're going over all the alerts in Kube Prometheus stack and all the other common alerts that people have because we all run Kubernetes so we actually all have very similar alerts to one another. And we're taking those alerts and we're adding on these Runbooks out of the box so that anyone who gets the alerts would just get a good alert by default without needing to configure anything. And then that needs to help your developers and people who bother you that's because now they understand why you're just firing. So that's our goal. Okay, so Robusta is MIT-dicensed and please take a moment and scan that and give us a star in GitHub that really helps us as a project. And as an open source project helps us spread awareness also about what we do. And I will also say we have a bunch of community Runbooks that are like going out and mapping the stuff in Kube Prometheus stack and mapping out other common Prometheus errors. So we're doing our best to really get wide coverage and this is one of the reasons why it's so important to us to be here today to really also engage with the broader Prometheus community and speak about what we're missing and what content we should add on. And let me now show you how you set this up. So the first thing we're going to do is we're going to go to the manager's configuration. We're going to add on the web hook receiver. We have four instructions for this online and of course the Prometheus stocks are excellent and that cover this as well. And essentially what we're telling now Prometheus or to be precise what we're telling the alert manager to do is we're telling the alert manager that when there's an alert you shouldn't go and send it directly to Slack anymore. We're going to send that alert to the Runbook engine. Now the next part is like kind of the hard part, right? Now you actually have to write the Runbook engine and the Runbook engine is this HTTP server. So it's getting all these alerts by HTTP over web hook and then you have to take those alerts and you have to parse them and you have to pull out the different context about the other and then you have to like contact the API server and pull in the logs and do all this other stuff. So this is the traditional way you would do it. We'd write a bunch of code that runs an HTTP server and has like a whole bunch of if-ouses and different stuff to handle each of the different edge cases. And what we've done with Robusta is we're trying to make this really as simple as possible and to bring Runbook automation to the masses. So to make that happen we now turn this into YAML configuration. So an alert reaches Robusta, Robusta then goes and it parses this YAML file. It looks up the alert that arrived in the set of rules and it says, okay, for this given alert how should I enrich it? What context is missing if this is an alert about crashing pads? Or if you get an alert about a node that ran out of disk space what do I need to see as a person who's on call in order to solve that issue? And then that's where the brother community comes in and all the community coverage to really out of the box give you all the stuff that you need to succeed at everything and to see that Kubernetes monitoring. So here's the configuration. I'm gonna go over the configuration now. So every automation, every Runbook and Robusta has three parts. There's a trigger. So that's the condition that's triggering this Runbook. In this case, it's the Prometheus alert from Kube Prometheus stack that's called Kube Pod Crash Looping. The action is what we're going to do when that arrives and the action here is we're going to pull in the dogs and the sync is where we send the data. So we're sending the data here to Slack. You can send it to Teams. You can send it to, like I said, Cisco Webex. You can send it to Apps Genie. You can send it all over the place. I believe you can even send it to DataDog. Okay, so the vast part then is I just wanna say a word though that's maybe not obvious when you look at this the first time but data is flowing through all of these. This is almost like a data pipeline. So if you look at this actually, then we're saying there's an alert here on Prometheus alert when the pod crashes but data flows from that trigger into the action. So the action now, it says go fetch the dogs but it knows who to fetch the dogs for because we're taking the metadata from that Prometheus alert that's there from the Kubernetes Prometheus discovery and then we're using that in order to map that alert to a Kubernetes object and then we can pull in the dogs from that Kubernetes object. So this is actually like a type to pipeline. Like the data passing through this knows that this alert is related to this pod. It knows the Kubernetes object. It has all the Kubernetes context and then you can very easily pull in the right data. And of course this is customizable. Okay, so this is the part where we all cross our fingers and I really hope that the Wi-Fi does not fail me. I always like to say like, especially when we go and we sponsor events, like never sponsored the Wi-Fi. Especially if you're a company that develops networking equipment, never sponsored the Wi-Fi. But we see the same companies doing that again and again. Okay, so I'm gonna run this. I'm now triggering an example there and I'm gonna jump over here to stack. I think I actually have an old like something in there. So let's see. Just put in a bunch of stuff so we're not gonna cheat too much. Hopefully we won't cheat at all. And okay, here we got it. And this other is still loading I think. So in just a moment, oh, okay, here we can see it. So what we got is we got it from Mathias there and then robust to automatically pull in the dogs through the pod that crashed. And we see that right there, the dogs through the crashing pod. And that's right there in the other itself. And that's this. And the thing to emphasize is this is all configurable in their rules and we have rules out of the box to really do like the right thing for most of the common alerts. So it's not just about crashing pods that just happens to be a very popular use case. Okay, so let's go back to the slides. That's the part where I do the live demo. I wanted to do more live demos but I was a little bit worried about the wifi so I'm glad we got that done. And I want to speak now about three more advanced use cases. So first of all, that's generalized the concept. Now the concept of run book automation it's not just about handling events that come from Prometheus. It's a very popular use case that we see among our users, but it's not the only one. You could say for example, we had someone say once, I have an ingress that's like crucial to the functioning of my cluster. And if anyone touches that ingress, I want to get notified. Or if anyone touches that ingress, go and take some operation. So you can actually do something like that very easily. It's the same concept. You have a trigger, you have an action, you have a sync. So the trigger is someone modify that ingress. The action is okay. Maybe pull in some data or show me exactly like the exact if that they modified. And then the destination is slack or on the steams or wherever. So there's an open source project called CubeWatch. That's fairly popular. And the original maintainers left that. So we took over as the official maintainers now and we're the official maintainers of CubeWatch. And we use CubeWatch under the hood as part of this run book automation engine. So you can actually track any Kubernetes change. And then you can get notified and you can even run all the made actions when different things happen in your cluster. So this is often very useful also if you want to do the sort of thing that's not really time series based but if you want to monitor discrete events. Like you might say, I don't want to monitor the number of crash of jobs that I've failed. I actually care about jobs. Kubernetes jobs is like a discrete event that's happening, a job failed. And then you can also do run book automation or notifications around that sort of thing. So this is very useful. We maintain CubeWatch. We're actually about to release, I think today or tomorrow a new version of CubeWatch that fixes a bunch of bugs. First version to come out in a very long time. So that's exciting. So that's one advanced use case. And then I want to speak about two more advanced use cases. So the second one is deeper insights. So you're not limited to just pulling in data like here or there, it's in connecting to the graphs and other stuff. You can actually apply logic because each one of those automated actions we saw under the hood, each one of those automated actions is actually a Python function. So it's very easy to extend this yourself. And you can actually take all the knowledge like that's in your head or in the head of the best person in the world that knows a specific application, how to monitor it and how to maintain it. And you can take that knowledge and you can turn that into automated code. So an example that I like to give is like, let's see again, they're about a problem in an elastic search. You know the problem and you might know maybe there are slow queries in an elastic search. That's great, but the first thing you're gonna do is you're gonna go and Google that, right? Because you probably aren't an elastic search expert and if you are then the same thing will happen for MongoDB and you'll have to go and Google it. Because there's really, we all deploy stacks nowadays that are very wide, very wide stacks and we don't all have, like you can't know everything. So one cool thing you can do is you can take specific issues and then take the knowledge about how to handle that issue and one time someone from the community really knows that topic turns into an automated run book and now everyone can benefit from that. And one of the great things about Kubernetes is you have this convergence where everyone is running on Kubernetes and many people are deploying the same home charts, many people are deploying the same software and actually suddenly my errors start looking a lot like your errors. So the potential of run book automation is the potential to really take the knowledge for how to handle errors that come up in production and to turn into automated community shared knowledge. So here's an example of that. This is something that we include out of the box. If you've ever got to know it's about CPU throttling in your cluster and you wanna know why you have high CPU throughout the thing, then we have coverage out of the box, for example. And we check a few different cases. We check for different known issues. We check whether the issue is the Kubernetes CPU limit. If so, whether it's safe to remove it. So the potential here is the ability to really capture the understanding of why different errors occur and then to suggest solutions and even automate the fixes. So speaking about automated fixes, that's the third use case and last use case that I'll look at. So we actually support either complete automations or what we call human in the loop automations. So you can add rules where there will be a button in the Slack message that you get that says, okay, here's an issue that we identified and you define in advance, like here's the fix that I think is the proper fix. You can either run that fix completely automatically or you can give people a button in Slack and they just push that button. And as soon as they push that button, then it'll go in there, run the automated run book. So you're running the automated run book but you're putting a human in the loop and you can automate that 100% as well. It really depends on your philosophy and how comfortable you feel with that fix. So we have more demos and demo scenarios like you just run QtCal to apply it to your cluster and then you can see the stuff in action. So we have more stuff in Git and please also feel free to reach out and suggest stuff. And we have coverage for a bunch of stuff out of the boxes while they're ready. So like team set, schedule the alerts, CPU over commit, Qt pod not ready, different node issues, job failures, CPU throttling, file system on disk space, image will back off. We have coverage for a bunch of different things and we're constantly trying to build that out. And this is where I say please send us your problems. Like we like hearing about people's problems and I have the issues here in this list. Either we're contributed by the community or we contributed because people came to us and they said I have this issue in my cluster, I don't know why it's happening and I'm busy and I don't have time to look into it. So I'm not gonna go and automate myself. So please look into this and can you add on coverage for that? You do that all the time and we love it. So like please give us your issues and we'll take them and investigate them and try and add on coverage. And I wanna just end with one more final story. How am I doing on time? Okay. Okay, so I'll be fine then. Yeah, so I wanna just end with one more story about why this matters. So who here likes Reddit or goes on Reddit regularly? Okay, all of us, yeah, I figured. And does anyone here know this subreddit that's called what is this thing? Okay, so there's a subreddit that I like that's called what is this thing and people post a picture of something they found. They go, you know, what is this thing? And then the internet tries to tell them what that thing is. So here's a post that I saw. I think it's one of the top all-time posts. And someone said I found this in the crawl space of a house from the 80s. My friend's house and it was next to these boxes. And so they were hard to read with the picture but it says radioactive material, no person shall remain within one meter of container unnecessarily. And the person who found this thought, you know, when you find a box that says radioactive material, what you do is you post it on Reddit and on what is this thing? And then you go off and go hiking or something. So this looks really bad, right? Like this doesn't look like there's a happy ending to this story. And I wanna take this as an analogy for what happens when you get a message in Slack 3 a.m., like you get a message and there's some issue in production and it could be a really, really urgent issue. And if you work in an industry like healthcare, I mean, depending on an industry that you work in stuff, like we deal with software, but the software that we deal with has real-world consequences on what people do. So moving on with the story, like this looks really bad and the advice on Reddit was like, go to the hospital immediately. And the person updated a three-man team from the state of Utah Radiation Control showed up at my friend's house and that's like the people with the spacesuits like in the movies and they go around and they took a bunch of probes and they swab stuff. And it turns out it wasn't really such a big deal. The team that came out found nothing but natural like trace amounts of radon. And it turns out the former owner was just the watchmaker and they used to paint watches. They glowed in the dark with a paint called Radium and most people know about this cause like the people who painted it would lick the watches and then they all got tongue cancer. But assuming that you're not licking it and you're not actually ingesting this, this is really not a big deal. And I wanna take this now as an example back to the learning context. Like sometimes there's an issue that looks like a big, big deal but when you add on a little bit of context it's really not that bad. And we speak a lot about dirt fatigue, right? And like not wearing people out with too many issues that look too serious and you realize that nothing is serious and then along comes a very serious issue when you ignore it. So the lesson of this story is not that you should ignore issues. The lesson of this story is that you wanna get context as fast as possible so that if it's a serious issue you can deal with it and if not then you can just relax, not be under stress. So that's the example of the power of context. And what we're out to do with Robusta and the reason that we're here in engaging these communities we wanna add on as much context as possible with open source runbook automation and with the community of all the different issues so that people can sleep better at night and when you get into learning Slack then you can have the context for what that means and whether it matters or not. So thank you very much for listening and we wanted to do something a little bit different in our booth so we brought the Kubernetes ship wheel. We wanted to try and bring Kubernetes into the physical world. I'm in front of a screen of that when I'm not outside in the garden but we wanted to hold Kubernetes in our hand and you can too. So please come by the booth and please say hi and please tell us about your issues. Like we love to hear and we love to add coverage. Thank you very much.