 Okay, so let's turn into a talk about anti-patterns and it's often not what we forget to alert that hurts us in terms of operational sanity really as the stressors talked about earlier today were a great example of that. It's often what we're alerting on that we shouldn't be. So earlier today in that talk I know we had about half ops people. How many of you would consider yourself on call for a service? Some sort. That's after this. And how many of those are on call at times they would normally be asleep? Or should normally be asleep? Okay, so most of you. I like sleep. We all like sleep. It's about alerts. There's a near infinite way alerts can be bad. But really there's only a couple and they all stem down to one core thing. Obsolete alerts. So okay, this thing is down because we turned it down a year ago. Might seem a little obvious but it's incredibly common for sites not to have good turn down procedures for systems or processes or oh, we've upgraded our service and we're going to need this auxiliary server. Making sure you actually turn stuff down when you're meant to is really helpful. Because quite often we've seen, oh, the obsolete box is still there and we don't use it, but who cares? Then a year later it blows a power supply and now it's alerting. Shouldn't have even been there to alert me. Or this thing has a bug. Which is interesting because that was fixed a couple of years ago and we upgraded. I mean, sure, regressions, but really? Even if they're not firing, these are really bad alerts because often a new member on your team, someone who comes along or someone who's transferred in, won't have the context to instantly know this is a bad alert and can be deleted when they're reviewing configuration. Or alerting configuration should already be in a revision control system of some time. Subversion, RCS, Git, CVS, it doesn't really matter as long as you have that ability to know you have the history and you can always revert, you can always pull back if you ever need to. Delete, worry about it later. So okay, what about unactionable alerts? Something's down, but it's managed by that team over there. Why are you alerting me? Some organizations have structural issues in that two parts of the same org will only ever talk through a third party vendor. This is amazingly common in large bureaucratic telcos particularly, but it's true in many large, particularly ex-government organizations that were partially partitioned in the government days. So monitoring someone else's systems might actually be the only way I'll know if there's a problem. It might be that that other team won't monitor their systems to the quality I require. Or it might be that my systems are so flaky that if theirs goes down, I can't handle that. There may be reasons for it, but in general, I shouldn't need to care if someone else's stuff is down. This one's a little more contentious. I am of the view that an alert that says my service has failed its SLA, if that alert is waking me up in the middle of the night, that alert is useless. Now what am I going to do about it? It's one thing to know I'm out of SLA, maybe plan, maybe cancel some work, but what am I actually going to do? Why was it a value to get me out of bed? Now if my service is out of SLA because it's down or throwing errors or the reason it's out of SLA that I care about, the only reason I can, we want to know it's out of SLA to report and tell management that we failed and the accounts team should expect complaints, we don't care operationally about the SLA as much. But do log it, reporting on this analysis, identifying trends, knowing before it happens that you're going to fail SLA is wonderful. Being told at 3 AM and being woken up just to hear out of SLA, that's a horrible thing. Now really bad thresholds are an incredibly common pattern. Really every alert has bad thresholds. They're just not always entirely obvious. This server has a high load average of four. It's got 32 cores. That's not actually a high load average. Back at a previous job I had to rewrite the Negios load average check for precisely this reason. We cared if a box had a load average of three or four per core, in that system it mattered. But we had boxes from anywhere from two to 16 cores. So we had to rewrite the check to make it generic and handle it. In a similar direction, oh, this disk is nearly full, there's only 100 megabytes left. Well if it's a 10 terabyte LAN, I've probably run out of time to do anything about it. If it's a 200 meg LAN, I probably don't care because it might be slash boot for a Sand Mounts, Sand Boot server. So thresholds often need to be dynamic. They often need to change over time. Doing alerts once they're created is one of the least, one of the things your admins yourself, your coworkers are likely to go for. It really only ever seems to happen when people are completely rebuilding their alerting stacks. And this often, I suspect this is often why people who completely rebuild their alerting are very happy with it right afterwards because they delete all the crap. Hair Trigger alerts are very similar. This service didn't respond in 50 milliseconds once, but it did respond in 51 milliseconds. And again, this goes back to SLAs, okay, it might be my SLA to respond in 50 milliseconds every time, but know exactly how that can fail. No wash is worth alerting about. Your alert log is good history here. Non-impacting is similar, again. If I've got one web server down, sure, that's something I want to know about so I can get it fixed. Don't wake me up if I have eight servers and I only need six and if it's a local web service, not a global one, the load at 3 a.m. might only be 2% of normal. I don't care. Now spamming alerts. Simon had an example just before, but I think nearly all of us have probably seen similar alerts at various times. There's things down for the trillions time. Even if this matters, you've stopped caring. The realistically, everyone on your team has already started ignoring this alert. It doesn't matter how critical it is. It fires so often, especially if it doesn't actually seem to break anything, if it's a your redundancy is impacted or this network link in the place that doesn't actually matter very much is broken, you stop caring, you stop acting and you start ignoring. When you start ignoring your alerts, you're going down a very bad hole. Again, logs are really helpful here. There's alerts for stuff that nobody cares about. I mean, my test server has no backups. Well, it's a test server. It gets rebuilt from Puppet. I don't care. I want it that way. And nearly all the earlier items, all they lead to is people stop caring. And if I'm getting alerted for something I don't care about, it's no different than the email sent from HR to tell me to stop doing stuff. It becomes spam in my inbox or spam waking me up at night. It's the result of a very bad spiral and leads. It means you don't respond. It effectively lowers your system's availability and a really subtle and insidious manner. Now some more related practices. Email alerts. So it's not a high priority thing. I mean, the server's down, but it's one of those web servers. And we've got plenty left. So we'll send an email. Yeah, within a few weeks, the entire team will have filters set up. Or at least the entire team that knows how to wrangle exchange or whatever email system you have will have filters set up. Now that said, having a separate list that receives copies of all your alerts for archival purposes may be the best way to get an alert log. In most instances, you already have a mailing list system ready. Set it up. If your email system happens to be Gmail, that's really convenient for searching as well or whatever else has search interfaces. And I don't care what you say. People will be ignoring email alerts if they exist. This is a sad reality, but it is reality. You can often test by, if it is a separate email list that you expect people to subscribe in action to, send an email formatted like an alert. See who notices. It's probably fewer people than you think. Now, undocumented. So this thing is broken. What am I supposed to do about that? More common in larger teams, even just a small team of half a dozen people, if I set something up, how does Simon know what to do when it breaks? I might be the voice expert, he might be the backup guy. It's not his fault that he doesn't know, it's my fault. If I didn't document it, sure he should be able to work it out from the first principles. But at 3 a.m. after a long night's partying, of course, on-call partying shouldn't do that, but document the actions to take. At the very least, document the way to document what the alert is catching so that you can verify in some of these systems when you get complex enough alerts merely replicating the conditions is actually somewhat difficult. Being able to go, yes, this flag that says it's bad is firing, but the conditions aren't actually being met, why is that happening? It should be documented such that all on-callers can follow. You don't need to make it such that the new knock guy you hired last week can handle it. You should be able to make it so the new engineer you hired last month can handle it, maybe three months. How long someone takes to become on-call depends on your site. Practices like this reduce that time and increase their effectiveness. Having an acceptance process for alerts is an extremely effective way to do this if you have people who aren't on the team writing alerts in the DevOps shop, if the developers are writing alerts for you, if you've got separate operations and engineering teams. Having a review process for your alerts and thresholds, bug queues require documentation, require those playbook pages to exist. And only people who are actually on-call should be accepting alerts. In general, this will probably be some senior people as long as it's on-callers that's at least mostly okay. Silencing. If your alert system is going to actually page someone, it's going to take any automated action, you need a way to silence it. In practice, this ends up being a whole system because you very quickly go from this, I need to take this server down for reboot, silence it, to, oh, actually, that causes impacts on these systems. It means these network ports go down. You need a system that resolves these and handles them. It's an ugly, not fun system. It's a lot better than grumpy admins. It's possibly the most annoying way to get woken up or lulled out of your mid-afternoon nap when someone else does work they were planning but didn't silence it right, and so you get tens, thousands of alerts. This may include things like scheduling. If you're doing this for network alerting, you probably want to schedule the silence for the carrier outage at 3 a.m. rather than having to have someone wake up and inject it or injecting it eight hours earlier. And one of the last and a horrible anti-pattern is production by Fiat. When your executive says this is now in production because I say it's in production, good luck. And lastly, I would highly recommend, many of you will have seen Tom Limoncelli speak amongst other things past LCAs with his usual cohort of friends. He has a new book, The Practice of Cloud System Administration. It has a couple of chapters covering alerting and monitoring. I don't agree with 100% of it, but it's really good material to start with and think about and it has the names of some of the most respected people in CIS admin behind it. Thank you. You said that you're alert before about the load average of four. You don't care because it has 32 cores. Isn't that irrelevant though? Shouldn't you be concerned with what is normal for that box? Is it if it's normally operates with a load average of two and suddenly it's gone to four, something's probably wrong? Okay. What to do about, you know, when there's a large number of systems that need fixing and the priorities have been put somewhere else when they're alarming. So realistically in a large environment, there's always problems. There's always more that can be fixed, more that can be improved. Ultimately, partly it's down to having some roughly decent priorities and alerts. Trying to make them perfect, it will never work. Having dashboards for alerts is a great way because you can identify actually there's a chunk of low price stuff that matters here, there's a big thing over here and alerting dashboards as much as they may, you shouldn't let them, you shouldn't have a hundred of them but having two or three is actually really valuable and lets you identify in the never ending stream what might actually be worth working on. Okay.