 Hello, thank you for having me. It's such an absolute honor to be here today. And the topic of my presentation today is how to sleep better at night and survive on call with the robust automations. And we're essentially going to be speaking about how you can take DevOps philosophy and you can apply the same principles that we use with infrastructure as code to how we handle others. But before I get started, I want to very briefly just give some more background about myself. So I've been a developer for more than 15 years. That's a lots of open source involved, also live cybersecurity professionally in recent years. And I also have been blogging for about as long recently I've started creating YouTube content. And for the past recent years, almost everything I do has been around Kubernetes at Kubernetes startups, both as a developer and in customer facing roles. And today as co-founder of Robusta. Okay, so I'm going to talk about three topics. The first topic is the history of automation and specifically the history of DevOps automation. Second topic, I'm going to explain why you should automate alerts the same way that we automate everything else in DevOps. And third, I'm going to be speaking specifically about Robusta and about Kubernetes automation, automating alerts on Kubernetes. So I want to start off with a trip to the past. And I want to go back 15 years to how I developed software when I got started. And I got started and I started working on some WordPress websites and I did some Python code and web applications in Django. And when I got started, then essentially there were a few different stages to how you develop software. First you built software and I would check out code locally. I would make my changes and I would build it locally. I would run some tests locally. And if everything worked, then I would go on to the next stage and that's the deploy stage. So I would upload files onto a virtual host, upload them with FTP. I would run an app get installed, install a bunch of dependencies, move files into the right place. Now I'm deploying my web application. Next step, I had to configure it. So I would go into a web console. I would set up some DNS records. If I deploy a new version of the application, maybe I would stop the old one and start the new one. I would configure stuff manually. Maybe I would configure some rules for Apache or for N-Jinks, set stuff up. Next stage, so I did everything well. I was really excited. I took my application, put it online, messaged all my friends on IRC, put a post on Hacker News and it really blew up. And now everyone is coming to the website and now I have a problem because now there's a lot of load on my server. So I add another server, deploy another copy. Essentially I go here and I do that deploying configure stage from scratch and now I have another copy of my application running. Lastly, something is going wrong. Something isn't working right. So if I'm lucky, I have some monitoring in face and I see that there's an issue and now it's time to respond to that and to fix the issue. So I respond to that, I connect to the server over SSH, start to investigate what's going on, I run PS, I try and fix the issue and I fix the issue manually. And almost every single part of this has changed in the past 15 years and it's gone a fundamental shift. In each case what we do today is totally, totally different than what we did back when I got started. So I want to show that. What we did is we essentially went through each and every one of these stages and we automated them. So for build, I'm no longer just building code locally and then running tests on my local machine, we have CI CD, like Jenkins, like GitHub actions, like Circle CI and essentially my code, every single night or after every single commit it's being built, we're running tasks, we're doing all that automatically. So I can't accidentally like go right ahead to the deploy stage without running tests or when something isn't building. I just, I wouldn't be able to get there because we have an automated platform and this build stage is now automated. Moving on to the next stage. So I don't deploy files, I don't deploy my application the way that I used to. Used to be you had to upload files with FTP, install your dependencies, run app, can install, do something different if it's a CentOS host and if it's a Ubuntu host, if it's a Debian host and all that had issues because you were really dependent on the machine that you were running off. Some machines would have different dependencies, different versions of the software you need. Maybe you're deploying a PHP application and there's PHP five on one and PHP four on another. Don't catch me on the versions. Haven't done PHP in a very long time. But the point is that when you deploy this software, we no longer deploy that same way. We deploy with containers. And essentially what I've done is by automating this and by using Docker, I've shifted like this entirely deployed phase, I've shifted a big part of that to build time. Now I have a Docker file, I build this container, that container is self-contained as everything I want inside of it. And when it comes time to deploy, I just take that container, put it somewhere else. I'm not dependent on the local machine. It doesn't matter what the dependencies are on the local machine as long as it has Docker. And my container there is self-contained. It has everything in it. Every time I commit and I do a build in the build stage, then that builds my Docker container as well. And deploying it just so much easier. I take a Docker container, get it running there on the right machine. So very easy. And this has all been automated with Docker. Now, moving on to the configure stage, then we also no longer do this. I no longer go and just set up records or configure stuff manually. We have infrastructure as code, whether it's Kubernetes and YAML manifests, or if it's the underlying infrastructure and stuff like Terraform. I run Cube CDL apply the instructions for everything what I used to do as a person. It's now contained in the YAML file or in the Terraform file. This is exactly how to set stuff up. And I just apply that file and everything happens. So there's no longer a manual stage here. Moving on to scale. And of course we do auto scaling, HPA, VPA, horizontal, pod auto skater, vertical, pod auto skater, even auto scaling groups with AWS. If you're not using Kubernetes in the scale phase, it's also totally, totally automated. And that's no longer something that we do manually. So if we look at each of these first four stages, then we used to do stuff really manually or we would automate it, but we would automate with these in-house solutions, these ad hoc solutions that each company invented for themselves. And today for each one of those, we have a platform. We have a really great piece of open source technology that let's us take that and let's us do it automatically and give us all these extra features and it's a whole lot more robust. If I move on to the last stage, then the last stage in many ways is still left behind. When alerts occur and when we respond to those alerts and we go and we investigate them, very often we are still connecting over SSH or running Cube CTL, the exec. We're still investigating stuff similar to the way we investigated in the past. And that's the area that with the Robusta open source project, we're trying to really automate and bring into the modern day. Okay, so moving on, there are a few common themes in all the types of automation that we spoke about in the previous slide, from the build phase to the deploy phase to the configure phase to the scale phase also to what we're doing through Robusta. There's a common theme here in all of them, two common themes. First, everything has become configuration files. Those configuration files are often YAML and they have high level instructions. What I mean by this is back in the day, like if you wanted to automate the build, then you had some ad hoc in-house solution and you would run a bunch of scripts, maybe you check something out, you do dot slash auto com dot slash configure and you run a bunch of stuff and you have a script that wraps that and you have all these scripts that you write, you no longer do that. You're just writing a YAML file. That YAML file, sure, it has some scripts in it and it has some stuff contained in it. Sorry, in that Docker file, for example, when you're defining how to build it, if you're building on GitHub actions and there's a YAML file there that defines all the stages. So we still have some of those same scripts but they're wrapped in this configuration file with high level instructions not just an ad hoc script. And the reason that these configuration files are important is because then it's running on an underlying engine that adds all these extra features for us. For example, if a GitHub action fails, then I automatically have the logs and I can see exactly where it's, it failed and I see them in the GitHub user interface. If I'm using a Docker file instead of like an ad hoc solution for packaging stuff, now there's caching. Each layer builds upon the previous layer. I benefit from Docker, they are caching. So by moving to these automated platforms, we suddenly get all these extra features. And as a user, it's a whole lot easier. I can write a short config file often in YAML and I get stuff that previously I would have had to write a whole lot of code for that. Maybe the best example of that is auto scaling. If I look at auto scaling, then I'm just putting in there like one line in the YAML file or like one short YAML file defining how to auto scale stuff. And now if something dies, it'll get brought back up. I mean, that's the self-healing aspect not the auto scaling. But if I change the number in there or if the CPU load changes, then suddenly stuff will auto scale. And I'm not, I don't have to write all this code or all these scripts that then like say, if this and then that and do this, I'm just saying why I want to happen. And the underlying engine Kubernetes will handle all the implementation details. So by giving a configuration file, I can work at a much, much higher level and there are huge benefits to that. And the second part of this, which is equally important is when I have a configuration file, I can take that file and I can stick it in and get. And the big shift that happened here for each of these phases, the big shift that happened is there used to be knowledge that was inside my head and that became machine knowledge. So if I look at building stuff, I used to go to a Wiki page and I would read like, how do I build this software and then we run a bunch of commands that no longer happens that knowledge for how to do it. It's automated, it happens every time I push a commit. It happens on GitHub Actions. If I want to deploy my software like create a new version of it, then I don't have to go and like run a bunch of commands or do something. It's just a Docker file and Docker is doing it for me. So essentially in each of these stages that we speak about, the knowledge that used to be like knowledge that was inside my own head, it has become knowledge and get repository. And that means I can audit that knowledge. I can share it with my team. If I need the company, the knowledge is still there. So essentially the two big themes here for what we benefit from for automation is first by having these configuration files, then we get all these, this extra high level functionality without writing a live complicated code. And second of all, by having these automation engines, we can take a workflow that was previously a human workflow and it's now a config file and get and therefore knowledge has gone out of my head and it's become a file and gets become machine knowledge that the machine can do for me. Okay, so moving on, then I want to speak now about how we can apply this to the area of responding to others. And essentially the concept here is can we take the same concepts, the same automation principles and can we apply it to on call? Can we apply it to how we handle alerts? Now, learning itself and the system for defining alerts, that's all very well known how to do. There's great technology out there. So you're probably already using Prometheus or DataDog or Dynatrace or AppDynamics or some other system or New Relic. So they're great at learning systems. Personally, I love Prometheus and AlertVanager. They're great systems to define the alerts. But what happens when those alerts fire? Someone's still has to go and investigate that. So can we really try and automate that process of taking the alert and then understanding what it means and you're standing how we fix it and maybe even fixing it automatically. And what we like to achieve are really three things that we've achieved in each of the previous examples for automation. One, we want to make it all work faster. We want to have a faster response to alerts. I was speaking to a company recently in, I believe in England and they said to me, our on-call team has to respond within three minutes because that's the maximum downtime that we're about in an entire month. That's the SLA that we promise customers. So in a live major companies and live teams, you really have to respond fast and therefore having some automation is really important if you want to meet that SLA agreement that you have with your customers. Two, now that sharing, just like by taking stuff and putting in the Docker file by putting in GitHub action or even by using may configure infrastructure as code and by using YAML files, you can kind of share the knowledge and take knowledge that was previously in your head like instructions, how do I set a server up and now it's just the Terraform plan. Now you just apply that file. So can we take the knowledge that's inside your head for how you respond to alerts? It can we just make that another file and get and lastly, we want to respond to alerts better than we do today. And if you look at all the previous examples from all the nation, then obviously GitHub actions is way better than just having like an in-house solution to try in some scripts that you write because there's a platform and it's open source and lots of features and they have on features and you benefit from those features. So can we do the same thing for that response? And can we really respond to alerts better by automating the way we do it? Okay, so I'm going to look at three pieces of open source technology that you can use to do this. I'm going to speak most about Robusta the last one which I work on personally by wanting to mention the other ones because they're also great pieces of technology and they're all open source. So first is StackStorm. StackStorm is the oldest technology here. So it's very mature and it's used to automate their response and to automate other stuff, to automate other workflows and do automatic remediations and StackFlow, the advantage and the disadvantage is that it's not Kubernetes native. So it will work on everything but it's not specifically built for Kubernetes. Moving on to the next one, Argo Events. Argo Events absolutely is built for Kubernetes and it's a great piece of technology. And the advantage and disadvantage of Argo Events is that the way Argo works is everything is just a pod or a container or Kubernetes resource. So with Argo, you can say when something happens then go and automate this by running a pod, by running this container, but then you have to supply the details for why is that container. And moving on to Robusta, the advantage and the disadvantage of Robusta is that we're really specific and we're really focused specifically on responding to alerts and remediating them. And we have like domain specific knowledge about specific alerts. So we're a little more specific. We're not in general purpose framework like the others, but for the use case of responding to alerts we've tried to really build it dedicated on that use case. But all three of these are great pieces of technology. And the Robusta philosophy really is three things. One, Kubernetes native. So it's built specifically for Kubernetes, and work elsewhere, but the focus is on Kubernetes. Two, all batteries included. So when you install Robusta, it should just work out of the box. If you send it your Prometheus alerts then you should get value even before you go and configure anything. So we have built in knowledge about common alerts. And three, it should be easy to use. We're trying to save you time. We're trying to help you automate stuff. So if you have to invest a lifetime in that for an app then it's really not worth it for you. And there are three core concepts in terms of how Robusta works under the hood. Triggers, Actions, and Syncs. Now conceptually a trigger is something that occurs in your Kubernetes cluster or outside of it that triggers a reaction that kicks off this automated workflow. So an example trigger would be a Prometheus alert fired and there's an alert going on, there's something wrong with your cluster that triggers a workflow. Moving on to the next part and action is what you do when that trigger fires. So an action can gather extra information. You can take that other and you can investigate why it happened. And action can also fix stuff. You can say, okay, I know what this problem is. It's a known trigger. So I'm going to go and automatically apply a fix. And there are many different types of actions. And that's the async, that's the destination. That's where you get the data, where it's sent to you, where you get a notification about that. So that could be a stack channel, for example. And I'll give some examples now of really each of these different concepts. So triggers would be Prometheus alerts, could be now these changes, anything that happens in your cluster. For example, if a new deployment was rolled out. And it can also be manual triggers. Like you say, okay, right now, I want to trigger an automated workflow. But there are many, many more triggers. I'm just giving three examples here. An action, again, we have something like 70 actions built in and you can add on your own, but actions could be go and run a profiler. I get a memory dump, fetch the logs for me, go fetch a graph from Grafana and send it to me. And there are many, many more actions. And moving on to sync. So we support Slack, MS Teams, Apps Genie, Telegram can send stuff to our webhawk. You can send stuff to Kafka topic, lots and lots of different syncs. And if you need something that doesn't exist, then just open an issue on GitHub and we'll get around to it very fast and add it on for you. And the last comment I want to say about the architecture is everything is strongly typed. So like if you have an alert, unless you have a trigger with a Prometheus alert that kicks off an automated workflow, then we know in the Robusta system, we have metadata about that. And we know like this trigger is an alert, a belt apart, or this is a alert, a belt and node, or this is a trigger that occurred because you rolled out a new Kubernetes deployment. And we have data about what triggered that automated workflow and that data passes to the next phase. And this is really, really important because it means let's say you had a trigger that was a Prometheus alert and that alert is that the node is running out of disk space. So we know that this trigger, that's firing, it's a Bellis specific node. So when that passes on to the action, then you can have actions like go fetch a graph of the disk usage. And the action will automatically know how to take the right node from the trigger and what Kubernetes object is your apply that action to. Or if you have a Prometheus alert that certain part is crashing, then that's the trigger and that passes to the action and the action you receives that specific Kubernetes pad. So you can then have an action like go fetch the logs and the action will know which part it should fetch the logs from. And then from the action, that same type data passes on to the sync and that lets you do cool stuff. Like you can say, if there's an alert on a specific namespace and then send it to a specific stack channel where the right developers are located. And you can do smart routing there because you know all that metadata about what the Kubernetes object is that was involved in this trigger that went to the action that eventually got to the sync. Now, I want to give a specific example because I've really spoken about architecture and high level. So here's one concrete example. You had a crashing pod in your cluster and the automation here and the automated response is the one that's very simple. Go fetch the dogs and tell me the dog so I can see why it crashed. And essentially this is like the world's simplest example of automating your alert response. Before this, you would get a message in Slack that says like, oh, a pod crashed. And then you would go and run Cube CTL dogs yourself and you would fetch the dogs and now we're just automating that process. So we're taking one step out of it instead of getting a message in Slack that says like your pod crashed and then you go manually and you fetch the dogs. The dogs are fetched automatically for you and then you get a message in Slack that has like some extra information. So you no longer need to open up the command line and connect to the cluster and then fetch the dogs. All the data is right there in the other itself. This is a really simple example but it's one that I think is good to demonstrate the general concept. Okay. So I see, I'm gonna just pause for a moment. I see there's a question already in the channel. So someone has asked, can you please share which tool has been used for the slides? And the answer is I've been using Canva. So Canva is an excellent piece of software and it's a SaaS platform and I have the paid version on that and I use that to create the slides. So that's why I used to create this. And if there are other questions, I'm going to answer more questions later on but please, I love it when people ask questions. I love audience participation. Makes me feel a lot better as a speaker. So please feel free to write your questions in the chat. Okay. Moving on. So we saw one specific example here of an automated workflow. And now I wanna show really how this automated workflow was written behind the scenes and how it actually works. Like if you wanna configure a workflow like this where every time a pod crashes then you get a message and stack with the dogs for the crashing pod, then how would you actually go and configure that? So it's really easy, just the demo. So here I have those three parts that I mentioned earlier, triggers, actions and syncs. The trigger is a Prometheus that they're called cube pod crash looping. And if you don't have a Prometheus that are like that don't worry, like when you install robust I can give you all those default alerts. Two, the action is dogs and richer. We're going to take this other and we're going to enrich it with the dogs. And you might be thinking the dogs are which pod? What's the dogs of the pod that crashed over here? We get the data from the other and then automatically the actions will run on the relevant pod. And then the destination we're sending it to is stack. And there are a whole lot more stuff you can do here by default for example, this is rate limited. So if like the same deployment is crashing about like a thousand pods we're only gonna get one notification for every 60 minutes or you can configure stuff like that. So there actually are a whole lot more options here but this is the general concept. Moving on, I'm gonna show a few more examples and I'm gonna give examples from different categories. The first category that I'm going to give here is I'm going to show how you can investigate known alerts. So when you install Robusta, I said batteries included. So out of the box, we include in Robusta remediations and investigations and automated workflows for your common alerts. So you just add on to Prometheus one webhook, the stuff gets to Robusta, you can continue getting your old alerts and you can send the Robusta alerts to a new Slack channel. And now your alerts, if they're in our library of known alerts, they will come with extra information. So the first example of that is CPU throttling high. So you add high CPU throttling, that's a known alert. And unfortunately when there's high CPU throttling it's not always immediately obvious how you fix that. You could fix that in multiple ways. Maybe the issue is that your pad has an incorrect CPU request. Maybe the issue is that you have a CPU limit which is configured to run. Maybe the issue really is that you don't have enough CPU and you need add on more CPU. Maybe there's something else on the pad that's interfering you to find things wrong. So there are a lot of different reasons why that can happen. And essentially the enricher, the automation for this specific alert, it runs in the decision tree and it figures out why that either is occurring. Moving on to the next example. Let's say you have a memory leak and your pod gets unkilled. Now, if you have a memory leak, your pod gets unkilled. It's gonna come back up. It's gonna get unkilled again a few minutes later. That's really inconvenient. Or it'll get unkilled an hour later which depends on how fast you're leaking memory. And the obvious thing that you're going to wanna know is you're going to wanna know, well, why is this pod leaking memory on me? So what you would do normally, if you were running like holding a normal server back in the old day, then you would like SSH into that server and then you would run like Jmap or JStack or you would run like a debugging tool and you would grab a memory dump of that. And then you would see like what's using up the memory and you could do two differential ones, like 10 minutes apart and see what's allocated but not freed. So we actually have a live stuff built into Robusta. Let's do that. That's just one more automated action. So there's an automated action to grab a memory dump from a Java application. There's one to grab a memory dump from a Python application. So here's an example with Java and here's an example where you can see where the top objects that were allocated in that application. And you can run this automatically every time, for example, that your pod is about to be unkilled. So that's a really cool thing that you can do. Then you can get a message and stack with the data. Now I see that there's another question here. Can we also configure an event with the specific dog, an event for the specific dog string generated with any pod as well with Robusta? For example, a pod is running but would be interesting a specific keyword error code generated by that application. Yes, you can. There are two ways you can do this. The first way is you can just define with a Vastic search. If you're using a Vastic search you can define the Vastic search monitor and Robusta is built in support for triggering these with a Vastic search monitors. So that's one way. And the second way it's not in general, it's not in GA yet. So it's in beta. And I don't know if we have it public in our GitHub repository, but please message me if you need it. You can actually use built-in functionality in Robusta itself. And then you can add on a trigger within Robusta needed to Robusta for a specific keyword. I don't believe that's in GA yet. So if you want that then message me and I'd love to have you beta test that. Moving on to another example, then we're gonna look at another known alert that you can automatically investigate through Robusta. And again, values included. So you just install Robusta, send your Prometheus alerts. This will all happen out of the box. You don't need to configure anything. But you can also write your own, of course, for your own alerts. So let's say you have the common Prometheus alert node file system space filling up. In other words, you're getting known this space on the node, then by default you will run an automated workflow to investigate that. And it will tell you what pods are using the space, tell you how much disk space is being used, how much is being used by the pods, how much is being used by the node itself, by the host. So just another example of how you can automatically investigate known alerts. Moving on to the next topic, you can also use Robusta for remediation. So let's say you're auto scaling and your auto skater reached the maximum number of replicas and you're like it's 3 a.m. And there's lots of load on your servers and you just wanna go back to Steve and like do a proper fix in the morning. So you can get in there in stack that says, okay, you reached the maximum for the HBase. Would you like to up it by 30%? There's just a button there in stack and you push that button does an automatic remediation. You can run these remediation automatically, entirely without even asking you in stack. We typically recommend that you do like ask the person in stack that they confirm, but you can also do entirely automatic. Moving on, then here's another example. This one isn't quite the alert response, but it also really helps with investigating stuff. You have a lot of features in Robusta around change tracking. So you can kick off these automated workflows, not just when there's an alert, but also for any change that happens within the Kubernetes cluster. So for example, here I've set up an automated workflow so that every time that you deploy a new version of your application, then we're going to your Grafana and we're adding there an annotation. That's the style nine. And we're showing you at this exact point in time a new version of your application was deployed. And that's really useful because then you can see a correlation between issues that occur like CPU going up in between new deployments. And to configure that, I'll show you the YAML for that. So here the trigger is just on deployment update. So when you update deployment and the action is add deployment lines to Grafana. It's an action that will go to Grafana and add deployment lines and you specify here which dashboard and the API key. So you really see here the power of this as general purpose automation engine but one that's really, really dedicated to other things use cases. And the last topic I want to speak about is, okay, like what's going on under the hood? Like it's great that I showed you can fetch the dogs and you can like add an annotation to Grafana and you can do all these things. Or if you want to do something that isn't one of our 70 built-in actions. So it's really easy to write your own actions. You just do it in Python, under the hood everything is just Python functions. So here's an example for the logs in richer that you saw earlier. It's just the Python function. And here you get this event. That's the event that you get from the trigger. So you get the data with the pod, it's a pod event. In Python you can say like event.getpod. Here you have the pod then you call a function like pod.getlogs. And we have a rich Python API. It's ultimately based around the Python, the Kubernetes API client in Python with some higher level stuff. We make it really easy now to write these own actions yourself. So you can easily add on your own stuff. I know for example, I believe the guys from White Hat Jr have written a bunch of actions internally to restart deployments under certain situations where they're unhealthy. I know there are a bunch of other cases where other companies have gone and written their own actions and opened up PRs for Robusta2 to add those back to Robusta. So we're seeing a lot of interest really and a lot of people writing their own actions around this especially at larger companies. Okay, so that's it. I'll pause here for any more questions. And I also wanna say, I really love hearing from people. So if you're listening to this talk and you liked it or if you didn't like it, then please reach out and let me know what you liked or you didn't like. I really love to hear people. It's the most satisfying thing about talking. And I'm very active on Twitter and LinkedIn so please add me on LinkedIn. I approve everyone. Feel free to follow me on Twitter and please reach out. I'd love to hear from people. And if there are any questions, then I'll stop here. Hey, Nathan, that was really an insightful session. It's really informative and I think most of the crowd liked your PPPTs, very colorful and very engaging. And I think people have already visited Robusta.dev and they're liking the UI as well. I think you've got a compliment saying it's particularly really good from Shakaib Arif. Thanks a lot and thanks for being open to accept the invitations on LinkedIn and Twitter. I think that'll be really helpful for people who'd like to really collaborate with you and maybe work with you at all. Yep. If I have another three seconds, maybe I'll just show Shakaib what it actually looks like with the slides. Do we have time for that? How much time is it gonna take, Nathan? 10 seconds. I'll show it, go ahead. Yeah, I'll just show it because I know people like it. So it's really easy to do these slides. Like I'm using the pro version of Canva and they don't pay me or anything, I actually pay them. But you can go in here and I can like search for a sleep. Let's do that in Hebrew. I can search in here for like sleep and then they'll find me like graphics of people sleeping and I can put those in. So that's really nice and it's easy to use. So I recommend people do that. Cool, cool. Yeah, so I'm really sure people are going to have some questions if there are any questions I think Nathan is going to be available on Slack channel for the answers. So yeah, thanks a lot, Nathan and Brazilian. Absolutely pleasure and thank you for having me. Thanks, thanks a lot, bye.