 Hello everybody and welcome to our next talk. SETI is a senior software engineer at a big company who will introduce us to some principles why not to get woke up by emergency calls that are totally unnecessary at 3 o'clock in the morning. Please give a warm round of applause to SETI. Hey, I'm SETI and welcome to my talk Understanding alerting and how to come up with a good enough alerting strategy. So first up I'm SETI. I'm working at a big tech company and I'm sure you never guess which one. I do a lot of analog photography and while everyone else was baking bread in lockdown I was starting to train Brazilian Jiu Jitsu. So don't fight me. Okay, so today I'm going to talk a lot about alerts and incidents. So that's why we very quickly need to distinguish between what's an alert and what's an incident. So I'm not confusing anyone. So first up an alert is basically an event that is generated by your monitoring system. Every time your monitoring system thinks, oh that's interesting, let's get someone to look at it, that's an alert. But what you as the operator will receive is the incident. So an incident is basically always starting with an alert or most of the time starting by an alert. And then you just going to look at it and you work with it and you try to resolve it hopefully as fast as possible. So let's start with a question. Who thinks that incidents are a good thing? Let's show our fans who thinks incidents are a good thing. That's not that many. Interesting. Okay, to actually understand if incidents are a good thing I'm going to introduce you to something that's called the line below the line framework. So let's basically start out with our system architecture. So just in regular ordinary large complex systems designed it consists of several components. We have multiple servers running behind layers and layers of load balancer routers. We have databases everywhere and just our typical system. And that's what's actually creating value for our company. But in order to build and operate and run the system we need a little bit of supporting infrastructure around that. So we need stuff like our Git servers, our CI CD pipelines, our monitoring system. Even the gyra you use to track your work items that's all supporting infrastructure and someone needs to maintain that. And then we have lots of people interacting with those systems. We have the devs, we have the SREs, we have program managers doing whatever program managers do. But they all going to interact in some way or another with some parts of the system at least. And the worst thing about this is everyone got it very differently. So I mean just ask two different developers in your company to explain to you the systems design of the software. And I'm pretty sure you end up getting three different architectures. And that's true for everyone. So PMs got a very different view on your system than the devs do. The SREs got probably a very different understanding of how it works. And I mean we are all quite familiar with that, right? And that's because everyone got a very different mental model. And if we combine that we get this pretty complex looking diagram but it's actually quite simple. So we have our system that actually generates the value. We have more systems around that to support them. And then we have where the actual work is done with the people. So you can't really touch the system. You can't see the code run. You can't see your database actually inserting something to it. You have a mental model of how it works. And if you want to update something and you want to change your system, you have to do it by interacting with a representation layer. And that's what the green line is. So the green line is our line of representation. You're never going to see a server being deployed. But you know that it is deployed because you interact it with some system that deploys it. And you have a monitoring system that gives you some view into the system, some glass of paint that just shows you how it thinks it looks like. But it's never going to be the entire system. No one understands the entire system. Everyone just got a tiny fraction of the system as a mental model. And if you want to update it, you have to update your mental model. You put your mental model, write it in code, put it in infrastructure as code, write YAML files, whatever. But you interact with some sort of representation of your system. And every time your mental model is flawed, a stale or whatever, if your mental model isn't correct, it will inevitably cause an incident somewhere around. So what are incidents? So incidents is the delta between how we think the system works and how it actually works. Or more precisely how it fails and how we think it might fail. And based on how we think it might fail, we design around that. So let me ask again, are incidents a good thing? So who thinks incidents are a good thing now? Yeah, that's a lot more. So yeah, incidents are in fact a good thing because we need them to update our mental models. If we don't have incidents generated by our system, it's very hard to keep track of how this system actually works and keep our mental models up to date. But are all incidents the same? I don't think so. And most of the incident management tools that are out there agree with me. And usually you've got these five severities of incidents. And they might be called a little bit different depending on the system you're using, but you have like those five broader categories. And they are categorized as an alert as a record, alert as a notification, alert as a page, alert as a page, and it's actually serious. And everything that could go wrong actually went wrong. And we would never want to end up in the SF0 category because that's just awful. And we can divide those five incident categories into we notify someone, these are lower two severities, or we actually call someone at 3 a.m. in the morning. And I hope no one really wants to get called at 3 a.m. It's just not nice. And you can imagine those severities as a SF4 is probably the same as in Recruiter reaching out to you in LinkedIn. It happens a lot. You never really look at them. But if you're actually looking for a new job, it might come in quite handy that you can go back to your records and see all those five recruiters reach out to me in the past two days. Maybe there's something to them. A SF3 is probably the same as receiving a message on Signal. You're not going to answer it right away. But you answer them once you get to it. And the SF2 is someone is calling you and you pick up and they say, okay, there's definitely something going on. And SF1, that's a whole different story. Like that's someone screaming at you on the phone saying, oh, God, this is so important. Something really, really bad happened. Well, we haven't really talked about the SF0 category yet. So I would argue that no automated system ever should create a SF0 automatically. If you have a monitoring system that detects some anomaly with your system, the highest it can go is a SF1. And if you want to have like a step up to that, if something really, really bad happened, then you have to manually raise either a SF0 incident or promote a SF1 to a SF0. And trust me, you don't want to be on SF0, because that basically means your C-level executives, they're already in cause with PR agencies, with the lawyers, and something really bad happened. So let's talk about how we actually can detect those incidents. How do we detect that something went wrong? So basically, we are doing this using service level objectives. And every time a service level objective is breached, we alert. Easy as that. But what's actually in service level objective? So if we look at the Google book and what Google wrote in their book on SRE, it says a service level objective is a target value or a range of values for a service level that is measured by a service level indicator. Okay, but what are service level indicators? Okay, same page, finally. Okay, service level indicator is a carefully defined quantitative measure of some aspect of the level of service that is provided. Still not much better. Let's make an example. I'm standing here, I do a presentation, and it would be great if I actually be here and do this presentation, right? So I'm checking if my heart rate is somewhere in between some values. So my heart rate is actually the service level indicator in this example. And the entire statement is the service level objective. And as long as that service level objective is true, I'm probably alive. I hope. Okay, but now that I get this notification from my Apple Watch and it says, oh, your heart rate is raised. It's probably above 175. It's a bad thing. I mean, I'm still here. I'm still speaking, so it can't be that bad, right? Because I'm doing a presentation and probably I'm quite nervous about it. So high heart rate is not that bad. It's just vital. It's just diagnostic data. And there's a rule of thumb that you never ever alert or call someone based on diagnostic data. And instead you alert on actual symptoms. I mean, if I trip over that wire here, fall down the stage, probably heart rate is still raised, right? But that wouldn't be so good for my talk. So another set of examples. Okay, we established that's not a good example. Different set of examples. Okay, I'm present. So I'm speaking, also very good. And I'm not speaking too fast or speaking too slow. So this looks like better SLOs. But I think they could still be improved a bit. And there's a certain art when you talk about site reliability engineering and it's coming up with actually good SLOs. And there is a general formula that shouldn't surprise anyone and that's you just divide the amount of good events by the amount of valid events and then multiply by 100. So you get an even percentage across all of your service level objectives. So all your SLIs are percentages and then you can alert on is the percentage in some range or another. But what does a good event and a valid event mean? Well, I mean, I think it's very dependent on the service you're providing and let's have a look at an example. Just as one example here, we are looking at an authentication and identity system. And for an identity system, it's quite valid to return HTTP 200, 401, 403, just general response codes that we would consider as good events because it would be nice if our identity system tells someone, yeah, you're not allowed to do that. And the good requests, the valid requests, they are all the requests that are in the range of 200, 400, 500. Because if there's something in the service goes wrong, we probably throw in 500 into the server error. That's still a valid event, but it's definitely not a good one. But in this example here, we just completely ignored the entire 300 range. And that's because we do not care about moved permanently. It's just no one cares about it. It's not an indicator if something good happened or something bad happened. So that's why we can safely ignore stuff like HTTP 301 moved permanently. But in another example, if we would look at a redirection service like Bitly or the HINY URL or something like that, then those 300 requests, they would be considered good events because that's basically the service we are providing. So, again, the good requests are all the 200 and 300 ones and the valid requests are basically everything else. But when we talk about service level objectives and service level indicators, there are like two distinct ways how we can come up with service level indicators. And for us engineers, it's probably more natural to come up with service centric SLIs. And service centric SLIs, they are usually stuff like, what's the queue depth of my message queue system? What's the object age in my cache? What's the cache hit ratio? What's the availability of the database subsystem? All those service oriented metrics. And they are certainly valid and good and we need them. But if we look at what product managers usually want, it's very different. They don't really care that much if my database is available. My customer certainly doesn't care if the object age in my cache is getting too old. What they want are customer centric SLIs. And that's what's also commonly known as end-to-end tests or end-to-end measurements. And they usually contain stuff like the error rate of an API, the availability overall for a service. And it's usually measured using outside in monitoring. And it's very, very important to have both types of SLIs. We need to have customer centric SLIs because they're probably much better suited for detecting that something on a regional level went wrong. Maybe my entire region for the cloud is down. It's usually catch by a customer centric SLI much quicker. But the customer centric SLI is usually not that good at detecting what part of the system failed, which server failed, which load balancer did something bad. And that's why we need to serve a centric SLIs because they look at the component level and then go from the component to the zone, to the region, to the entire world. Well, the customer centric one, they probably look at the entire world, our entire service, then look at the regions and they don't really get much deeper than that usually. Okay, so we covered how we can detect incidents, how we can detect stuff, and in which severities we alert. But how does the alert actually get to us engineers? How does an alert reaches me at 3 a.m.? How does one wakes me up at 3 a.m.? And that's what's commonly referred to as alert routing. And first of all, I'm going to show you how to not do it. You trust me, do yourself a favor, just don't use Grafana managed alerts. It's the worst type of alerts. And I have to excuse myself. I use Prometheus Grafana, PagerDuty here as examples, but you can just replace them by any tool you want to. It's very generic. So that's just on. Okay, better option would be, okay, let's just use Prometheus alert manager and alert manager notifies me on Slack. I mean, it works. I've seen a lot of systems work like that, but Slack can't really wake me up at 3 a.m. So what's usually done, and this is probably the most common type of alert routing that I've seen, is the alert manager decides, okay, it's a SF2, I'm going to send the alert to PagerDuty and PagerDuty is giving me a call. If it is not SF2, if it's something lower than that, just route it to Slack and someone will look at it. But there's a problem. I as an engineer, I have to watch on two different systems. And there's also quite a disconnect between who's actually working on that incident. I don't see it from the Slack status. I don't see it really in alert manager. I have to go to PagerDuty to see who's working on this. So an improvement to that is usually look something like this. So Prometheus always routes to PagerDuty and PagerDuty routes to Slack. By that, we can have annotated Slack messages that say, okay, I acknowledge the incident, I'm working on it, and it shows up in the Slack channel. But it's still not ideal because I have to calculate the SLA reports once a quarter. How do I do that? I don't want to go through Slack. I don't want to search 110 incidents on Slack. No one wants to do that. So PagerDuty is not much better. I just see on PagerDuty all I can see is, okay, I've sent 10 calls in the past 24 hours, and 235 calls in the past month. It's not going to help me much. So this looks a little bit more complicated. But it's actually not. It looks complicated, but it's a bit more centralized. So Prometheus always should route your alert to a central alertings ticket system. It could be your gyro that you're already using for tracking development work. It could be fancy incident management ticket system like FireHydrant. But the point is you have all your incidents on the central ticket system that you can query whenever you like. You can generate executive reports. You can generate quarterly SLA reports, all that kind of stuff right from the ticket system. And the ticket system makes the actual distinguish between Dustus Alert as in CEP4, is it just an alert as record? And no one will look at it anyway except something bad happened and you want to see if there's a CEP4 maybe lurking around somewhere, then people can go to FireHydrant directly. If it's a CEP3, yeah, we notify someone on Slack who cares. Okay, it's a CEP2, CEP1, CEP0, something like that. Route it to PagerDuty. PagerDuty wakes someone up. And when I, as the engineer, go on call and I get paged by PagerDuty, I know exactly where to go. I just go to my ticket system and pick the incident that's right up there. I don't have to rely on the SMS that PagerDuty sends me on those 130 characters where no incident description whatsoever fits. I can query that. I see my incident. I can work on it. And I maybe notice that's not an incident for my team. That's just not my service. That's a problem with the underlying storage or something else. And I can just reassign that ticket. And as soon as I reassign the ticket, PagerDuty just goes off and calls that person from the actual responsible team. So that's actually way better. And it gives you the ability to query. All right. I'm actually much faster than I expected. So that's 21 minutes. And I thought I couldn't make it in an hour. To my defense, I tried it yesterday and it took me 45 minutes. So I probably spoke very fast. Sorry about that. So just a quick recap, we talked about that incidents are actually a good thing. Don't be afraid of incidents. In my job as a resiliency engineer, I see so many people that are just horribly afraid of an incident. I don't know. To interview people on incidents, that's my day job. I interview people, okay, what went wrong in your incident? And they're already freaked out when I reach out to them. They just don't like incidents. It's very stressful for them, but don't. Incidents are a good thing. You can learn from them. Embrace them. Don't demonize them. Categorize your incidents carefully because trust me, because your disk is running out of memory. There was a very good talk last time on GPN from Momo who talked about that stuff. Yeah, just don't. Someone can look at this during business hours, but I don't want to be woken up at 3 a.m. Define carefully, define SLOs and refine them. Maybe on a monthly base, maybe on a quarterly base. Depends on how early you are with working in SLOs. Just refine them. Do monthly or quarterly refinements of your indicators? Look, if your indicators are still valid because that might happen. I worked with systems that out of nowhere we suddenly had a spike of expiring certificates. And we realized we need service level indicators that checks the expiry of X509 certificates. Review them, refine them and your grant. And last advice, just don't use Grafana manager alerts. I'm maybe biased because I had to work with them and just don't. Use alert manager. All right, thank you. You can find me on Twitter, on LinkedIn, GitHub, whatever. Just reach out to me and I guess it's time for Q&A now. If you have any questions, hand up. Yeah, I'm just wondering about the Grafana alerts. I'm still wondering why it's so horrible. Yeah, so Grafana managed alerts. I mean, they work. You can do everything you could do with alert manager alerts. My personal problem with them is you have to configure them in the UI. I hate it. Seriously, my arm hurts from using the mouse. I don't want to do that. I want to write the alarm definition as a YAML document. It's just easier. I can check it into Git. I can see if someone modifies it. I see exactly when and who and why I modified it. All right, answer your question. Awesome. So my question, my question is about the as a low based alerting. You have a lot of high traffic systems, but usually you also have systems that have way lower, or get way lower traffic. For example, you have some small management service where you create data for another high traffic service or something like that. Usually it doesn't work that well because if you get a single 500, your error rate is already at 100% if the service doesn't get much request. What would you recommend for alerting here? Good question. It's obviously much harder if you are in a low traffic environment. Maybe just increase traffic. No, sorry. We can follow up on this later with like a motto or something, but I don't have any solution of that top of mind. Do you have any examples on an alert 1 or 2 which would wake you up at 3 p.m. that you can talk about? I have to mentally check my NDAs. So the type of systems that I usually work with are cloud products, so it's hard to talk about them without revealing any PII. Stuff that woke me up at 3 a.m. where things like when an entire region went offline for no reason and we didn't really know what was going on. And the fiber finder 3,000 in yellow just dig the hole and ripped out an entire fiber cradle and everything went dark. And we lost the entire region. It was bad. It took a month to restore. So you said it's important to learn from incidents. Well, you mentioned that you call people. So what's your runbook learning start of incidents? So you do have postmortems, right? And one term that I hear a lot is blameless postmortems. You have to do blameless postmortems. But what blameless usually means is sanctionless. Because I can safely tell, okay, I fucked up. I pulled the wrong cable. I applied the wrong configuration. I applied the wrong sub-mental model. It's still taking blame, but you're not facing any sanctions for that. So that's a very important step. So you have to be sanctionless. Don't punish anyone doing a mistake. It's just human error. And the way that we work is we follow up on incidents and we talk to people, hey, can you give us, all of you, what happened? What did you do to resolve it? And then we just compile that to a narrative and we try to teach that narrative to everyone involved and we try to find up themes that occurred during the incident and then we try to systematically eliminate those themes. So it's not that much of... We work with one incident and just focus on that one incident and we make sure that this exact incident never happens, but we try to work on the themes that occurred during the incident. We try to extract themes, we try to collate themes across multiple incidents and then figure out, okay, we have 10 incidents and all of them involved software stack X and Y, so we probably have to do something about X and Y. Does this help? I'm not sure if I answered it. Okay, thanks. Any more questions? Test incidents and test alerts? Could you describe them a bit? Yes, we do quarterly drills. I can't really elaborate more on them, but it's basically chaos monkey style. You just plug random... You just go in the data center, rip some cables out and look what happens. I can't go into the detail, sorry. I was wondering, I'm working more with electrical engineering and some scientific stuff where you also have to monitor things like temperature, flow rate, things like that. I was wondering if a similar model of alerting would be useful there for long-running things or if you have any experience with that? I don't have any experience doing that, sorry. I work with clouds. I don't know much about electrical systems. Sorry. I guess you also need an IT service management concept or framework like I tell. Is it still valid with all these incident problem management, capacity management, asset management? Yes and no. I very much dislike the concept of ITIL. It's just way too bloated. I don't like it. Some concepts in ITIL are very good like the asset management part. Yeah, it's crucial to know the dependencies of a service. If my storage system goes down, I want to have an immediate graph of which services are dependent on my storage service so I can inform them so that I can inform my customers that depend on it. But if we stick to what ITIL specifically does, I think it's just way too bloated and way too formal and I think it's written by people who don't know how IT systems work. At least modern IT systems. It might be a good thing for the 80s, but we are not in the 80s anymore. Sorry. I hope I didn't hurt any feelings. Okay, yes, it sounds very good. Very open to that. So you made a point why incidents are good to build good mental models. Can you think of other ways? Yeah, brownies. Or whiteboarding, whatever. Yeah, whiteboarding. A lot of brownback sessions like I don't know, probably more known as lunch and learn. You just have lunch together, teach your peers something that you know but you think they didn't know. Just go on lunch with your co-workers. Talk about how you think the system works. Go to your senior engineers. Go to your principal engineers. Let them explain the system to you. Another very good opportunity to learn is when a new hire starts in your team. And it's especially good if a junior new hire starts in your team because they don't know what you're talking about. So you have to explain everything very, very carefully to them and then you might end up learning new things that you didn't realize before because you have to go in so much detail about it. Yeah, just some examples to that. So, thanks for your talk, Cedric and maybe see you at a trunk, at a mart or at your next talk. Awesome, thank you so much.