 Wow. Cool. Thanks. I guess we'll start. Hello, everyone. Welcome to your last day of DevConf. I hope you've had a lot of fun this weekend. You didn't do too much partying last night. I didn't. So we are going to be doing a talk on holistic monitoring. So that's what this is about. We are obviously at DevConf 2019 for No Check Republic. It's also Sunday. Cool. So we're just going to introduce ourselves real quick. Jared, do you want to go first? No. Okay. I can introduce you. I'll do it. Okay. Can you hear me? So, hello. I'm Adam Mentor. I work for P&T DevOps, which is the department that helps our engineering department at Red Hat. Kind of get code shipped to our customers. We maintain a fun thing called the Release Toolchain Pipeline. There's a lot of us, and I'm a project manager. And this is Jared Sprague. He works for the customer platform team, the customer platform operations team. He's a principal software engineer, and he looks great in this hoodie, but he decided to take it off. I don't know why. Yeah. Oh, wait. I have to think too. Cool. So this is what we're going to be covering today. We're going to set you up with a problem statement. Kind of want to walk everybody through, get us all in the exact same mindset about what we're actually trying to solve. Then we have a solution for you, of course, that we're going to cover kind of at a high level. And then we're going to give it over to Jared. He's going to give you kind of an in-depth look at each piece of our solution, how we think it's going to work, and then wrap it up with how this is actually working for us today. Give you some examples and kind of show it off. It's really cool. And then we'll have maybe hopefully some time for some questions. Problem statement. So there's two big features of the problem statement that I need to point out and get everybody thinking about. It's both perception and miss signals. So these are very important pieces of currently the problems that we face around monitoring, at least within our department. And I'm going to assume in the department and teams of all of you great people here in this audience. So this first one is perception. This is kind of a weird word to see in a problem statement. It's oftentimes not really used. A lot of times when we're working with technology, we're kind of using problems that are specifically about technology. Like this particular, let's say, JSON sort of translator doesn't really do the job that I need. My problem is that I need a better one. But perception is a different kind of issue entirely. And it's one that's harder to actually solve. A lot of times because you know that this is a problem, but it's very difficult for you by yourself to fix it. What do I mean by this? So monitoring, if we think about the term monitoring, it's oftentimes always equated as a single tool. So what I mean by this is what all of you, I think, do in your head is when you think of the word monitoring, your brain just automatically fills up with a tool. So for example, you would say monitoring is Xavix. Monitoring is Nagios. Monitoring is Prometheus. And you keep this kind of perception in your head that this is what monitoring is. This is all it's about. This is all it will ever be. And this is a giant problem. And the reason this is a problem is because you start to misalign what monitoring is because of, like, you misalign what the tool is capable of doing with what monitoring is. So basically, like, I'm kind of giving this image. This is the meme of current about, like, kind of higher intelligence than nirvana. That's really where we want to be. Because monitoring is not just a tool. It's a wildly complex, wildly complex and persistent art of principles, tooling, technology, and innovation. Okay, cool. So I'm reading that out loud on purpose because I'm going to try to say it later. And every time I've rehearsed this, I've actually never remembered it. But this is what monitoring is. It's not Xavix. It's not Prometheus. These are tools that are solving some function and solving some problem, but they're not monitoring. It's this statement. And it's that nirvana, right? Like, it's that cool, fun, relevant meme. But there's an example that I can use to kind of get you in the same mindset here. So you are a contractor and you're going to help build a house. You show up to the job site and the person that's kind of managing it, they're like, oh, yeah, welcome, cool. I'm glad you're here. You know, we're going to get started as soon as we can. Here's your screwdriver. Go build that house. And so I know probably some of you hate this example already. But you're giving a screwdriver and you're kind of like looking at the person, you're like, this is it? Like, yeah, yeah, go on. Go build the house. You got this. There's like, I mean, we could be like a buzzsaw or like a hammer or anything like I can use one of those. No, no, no, we don't use we only use screwdrivers. We have a deadline of next month. We have to get this done as soon as possible. Giving screwdriver doing anything else would just delay things. We just need people to finish it and just do it and you're like, oh, yeah, yeah, maybe the screwdriver can do everything. Yeah, that's that's a good point. So again, yes, this is a contrived example for sure. But we're going to keep going with it. You spend two more weeks going through building this house. Your hands are basically bloody. You're missing a finger now. Sort of walls are falling over. Everything is on fire, of course. Right, this build is going terribly. The house is not built. You're not going to meet your deadline. You're getting super frustrated. So you kind of get the courage to say, I got this. I'm going to stand up everybody after work at five. We're going to hold a forum. And I just want to I need to present a new idea. We're going to do it. Okay, cool. So five p.m. comes you find your pedestal. I was kind of hoping to reverse pedestal, right? So you get off of your pedestal and you kind of stand up and you hold this forum and you're like, I understand all of your concerns around having new tools. But at a minimum, we have to have a hammer. We have to incorporate a hammer into our jobs. It's going to make this thing so much better. This is not working, right? So you see somebody from one of your coworkers raise their hand. What do you think they're going to ask? What is the first question that you think is going to come out of their mouth? Do we get a raise? Yeah, that's a good one. What kind of hammer? Okay, yeah, now we're getting there. What about having? Yeah, okay. See, you're already like, I was kind of expecting and all of you be like, they'd be psyched because they get to use a hammer and nuts. But in my weird contrived example and they start to raise their hand and they're like, so what is a hammer? Yeah, perfect. Yeah, they're like, what's a hammer? And you're like, oh, okay, well, a hammer is this, you explain it, you show them the what is it? Anyways, yeah, so you show them all the stuff and then the other person's like, what is there any documentation around the hammer? We got that. Is that going to be available? Is there any need to be processes that are going to be kind of drawn up to use the hammer? Somebody's like, oh, yeah, that's a good point. Are we actually, is there any kind of long term strategy around kind of replacing the hammer, replacing the screwdriver with the hammer? Is there going to be like decommissioning the screwdrivers at any time? And somebody's like, that's a good point. We do need to make sure that we're properly decommissioning these screwdrivers and moving the users off and communicating it correctly. And you're like, you're losing your mind, of course, right? And then somebody else raises their hand and they're like, is there an open source alternative to the hammer that we can use? I knew that joke would kill. I was like very positive it would. So it's nonsense and you're probably, I mean we're laughing, but this is, and of course most people would be like, dope, a hammer, can we get more tools? We need to build this house. But this is actually a problem that happens in monitoring right now. So replace hammer screwdriver with Xabix, Nagia, all these other tools, Prometheus that you might have, and the same conversation is happening and the same kind of like stalling and I don't want to say bureaucracy, but it's like this perception problem of monitoring as a single tool is getting in the way of us adding a new tool like a hammer to our stack. So what does this do? Like what happens if you stay with this perception? Because it's never going to fix itself. What you're ultimately going to end up with is this sort of like, you're opening the door and there's these like gross little demons that are just kind of like following out like technical debt and stuff. If you try to change your tools, you have to do all this work to ensure that you have feature parity, right? You have to do all this other stuff that's like, you know, also doing the actual tool change to migrate from a Nagios to a Prometheus. You have to do it all at once, all or nothing, right? That doesn't happen. You're having redundant tools. So now you have a Prometheus instance and a Nagios instance monitoring the exact same thing and that's just dumb, right? You also incur technical debt with that same problem, right? And finally you've got the people that are sort of raising their hands that don't want to use the hammer, they're happy with the screwdriver. They're really happy with Nagios. They don't want to move to Zabax. So this is ultimately what perception is kind of stalling the progress on, right? But we have one more problem that we have to solve and this is around missed signals. This will be a little bit easier I think to grasp onto. We're all pretty familiar with this. So we'll go through this pretty quick. But I do need to pull the room. I want to see by a raise of hands, how many have heard this statement, said it or read it or thought it. This statement right here. We didn't catch that on our monitoring. I'm like very curious. Is that everybody? No, it's not everybody. It's most people. But yeah. Okay, good. Cool. We're good. This makes me feel like this. This is like how I feel when I hear or say or read. I'm hysterical. I'm just going nuts and I just want to die. So with missed signals, you have these kind of two scenarios that I'll briefly run through. This is nothing new. You all know all of these things. You have noise, right? So you have alerts that kind of spawn, that don't really matter. They're not telling you anything. I mean, they're telling you things are wrong, but nothing's actually wrong, right? Because you've got operational status. Everything's A-okay. That's also a plug for the Red Hat status portal, by the way, which is quite nice. Jared helped work on that. But this is noise, and you're getting this all the time, and this doesn't help you. This causes alert fatigue, right? But then you have the other scenario where like nothing's getting alerted to you. But then, of course, everything's on fire. You're getting service five or three hours, all this whole thing. So these two scenarios are kind of like a yin and yang, right? So you have the signal to noise ratio. And really as engineers, what you want to be doing as developers or whatever technical stack you're working in, it doesn't matter. You want to be doing as much min-maxing as possible. I think most of us have probably played World of Warcraft or a Diablo or something like that, where you want to be min-maxing your character build to get the most damage per second. You want to do the same thing with your signal to noise ratio, right? And you need to have all the tools available to you to actually make these sort of tweaks. But if you're operating, again, back to that perception thing. If you're operating with one tool, doing this actual min-maxing and kind of dealing with this yin and yang is wildly difficult. You've only got a few vectors to kind of work with and it just makes the problem so much more difficult to solve. And then, of course, if you can notice in the background more of that meming, so. This is our solution. So JR is going to walk you through pretty much all of this, but I kind of want to go through a high level real quick. It's this. It's this big old box of all kinds of words and boxes. This is our holistic monitoring strategy. So it's doing things a little bit differently. You can read if you want. You probably already are and you're not listening to me anymore. But the monitoring strategy that we've decided that we kind of have come up with as a group is that instead of thinking of monitoring as one tool, you think of it as a collection of functions. On this screen, there are nine functions of monitoring that we think are important. There's probably more. We probably didn't catch all of them, but this is the initial list that we came up with. And what we think that this is going to help us do is break down kind of what you need from a functional requirement base of each tool. So think it back to my house-building scenario, right? You need screwdrivers, so you can turn screws into something. You need hammers so that you can pound a nail into something. You need a buzz saw so that you can cut things into smaller things, right? Each one of those, ideally, would kind of go in each one of these different functions, right? As a buzz saw, you need to cut wood, so that's a function. At the top of each box is the name of the function and you can kind of get an idea of what each one represents. So what this is going to help us do with both perception is make that whole issue of tool change and tool lifecycle kind of out the window. It's more about knowing your functions and knowing what tool is going to help you solve that function. For instance, Zabx is a great tool for infrastructure monitoring, but it is not a good tool for dashboarding. I mean, it has dashboards available to it, right? You can go and configure them, but there's a great thing called Grafana or Kibana, probably, that have these Zabx plugins that you can actually do in manipulation of the Zabx data in Grafana directly. And it's really interesting when you play with this because I did this like yesterday. It's way faster in Grafana, which is crazy, but that's because Grafana, if you use it as a dashboard tool, is going to help you out far more. The last part of this problem that this solves is their signal-to-noise ratio. Where I was saying before, you have this tweaking and this min-maxing that you need to do. You suddenly have, you know, nine more variables to start tweaking that and min-maxing it as you want. So, yeah. With that, I'm going to stop talking. I'm going to give it over to Jared. We're going to do a little... As Adam was talking, I was thinking, he's probably Steve Jobs and I probably was in the act. He's way... He's way more charismatic than I am. I'm going to be talking about more boring technical stuff, but the detail is about each part of the monitoring functions in the strategy. So, and if you'll notice on each one of these slides in the top right corner, we highlight which function we're talking about. So, the first one is infrastructure. And this is the one everyone is familiar with. Like, I think it's the most common, everybody knows what this is. And it's all, you know, what are the metrics on your host, your origin systems, like CPU, memory, disk, swap, network IO, all of that stuff. But infrastructure monitoring by itself tells you what's going on on the host, but an origin host needs to serve a purpose otherwise it's just hardware. Like, why would you... And if it's not serving a purpose, is there even any point in monitoring it? Infrastructure hosts like services. So, the services and the applications are also need monitoring. And that's what APM, which stands for Application Performance Monitoring. This is going a level up from infrastructure to monitoring all the metrics of your application or service itself. Like, what is the time spent in MySQL or Memcation, or how much time do you spend calling out the external services and which ones of those are slow and which ones are fast? What is your overall response latency for your users? And then also, what is the life cycle of a request through your application? Like, what are all the layers it goes through and how long does it spend in each one? So, APM is very, very good at troubleshooting performance problems with your services and also diagnosing issues. Okay. The next one, availability. So, this one is probably my favorite function in the monitoring. So, the... Yeah, so, you've seen... This is a scene from Silicon Valley when they're in the data center. But I put this on here because these are like origin servers running, right? This is like your OpenShift cluster or whatever, running. And he's pointing at an origin server. So, the fallacy of origin-based availability monitoring is that customers don't actually go into the data center plug into your origin server and start using it, right? If they did that, you could get by with just origin-based availability monitoring. And when I say origin-based availability monitoring, I mean an example is a Prometheus instance running an OpenShift on the same OpenShift cluster, scraping internal service endpoints of your application. That is an origin-based availability monitor. That does not tell you if it's available for your customer or not. It tells you whether it's up on the origin. So, here's an example of what I mean. So, here's your users using a service, right? Here's your origin host over here. This could be OpenShift or it could just be a VM or anything. There's a vast array of layers in between the user and the actual origin server. You know, so, yeah. And I've listed just four really common ones. CNN, firewall, and load balancer. So, let's play a little bit of monitoring family feud here. Okay. All right, let's say this happens. Someone misconfigures a CDN routing rule. Who thinks that an origin-based probe is going to detect that? Okay, one person. So, origin-based monitor is going to say it's going to still pass, right? You're not going to get a signal out of that. Who thinks a user-centric black box monitor is going to detect that? Yeah, it's going to fail, right? Because it has to go through all those things and expect a result back. And this is why, if you want to monitor availability of a service and you have users using it, like from outside, you have to do it through user-centric black box. You can do both, obviously. But the thing is, you want to do both because not only this gives you two benefits, not only do you know the operational state for the user's point of view, but if your origin monitor is passing availability and the black box monitor is failing, it helps you to narrow it down to somewhere in between quickly. And this is not theoretical on the customer portal. All of those things have happened where, like, someone uploaded a list of IPs to the whitelist in the firewall that was wrong and suddenly an origin host is not available anymore. Little balancer, CPU is so high. It can't route requests to your origin server. That's all stuff that's happened to me on a customer portal ops. So this is really true. And I think, as far as the monitoring strategy, that this is probably one of the messages I really wanted to drive home because I hear a lot of people say, when we need monitoring, we need availability monitoring. Let's put Nagios on it on the host and now we've got availability monitoring or checking to make sure the service is running, right? So we know it's available. But that's not going to tell you if it's available for your users. You have to have black box. And the reason why it has to be black box? Does anyone out know why it has to be black box? Yeah, it can't cheat and it exercises all the dependencies because your service is only available as its dependencies, right? So if you're just checking a health check to say, oh, yeah, a health check endpoint and this is up, that doesn't mean that users don't use health checks. Users do actions on your service. They do searches and stuff and they expect results back. And those results exercise dependencies. If any of those dependencies are down, as far as your users concerned, it's unavailable. So it has to be black box to be executing actual user transactions like doing searches, logging in and checking their status or making rest calls and expecting a result back. So that is that. Okay. The next function, real user monitoring. So this one is only applies to applications with web front-end. So this doesn't apply to rest services, for example. But if you have an application that has a web front-end or a single-page app and you want to monitor what the user is experiencing, that's what real user monitoring does for you. And this is an example of the front-end of the Red Hat Customer Portal where there was a major performance degradation. And the way that this works is think of it like, if you could take from, if you could go into your user's browser and pull their dev tools data and analyze that, it has all the like waterfall of all the requests that happen on the page plus all of the asynchronous calls that makes out to rest services like single-page apps do and then store that and analyze and trigger on that. That's what real user monitoring is. It's just a bit of JavaScript embedded in the browser that extracts that and sends it so you can trigger on it. So that's rough. Okay. So logging. Okay. So here's one of the earliest examples of a log file. And it's got like a triple meaning here and it's like an actually carved into a log. Yeah. So, yeah. So this is a scene from the Lost Colony in North Carolina history where Sir Walter Raleigh sends a colony to North Carolina and he says, all right, you guys, you're on your own. I'm going back to England. If anything happens, leave me a message so we know what happened to you, right? He's gone for a long time because the war starts with Spain. He finally comes back and the whole colony is completely gone. They're all their cooking pots are still there and everything, but all the people are vanished. And all they found is this message in this tree called Kroa Toa, which they've no one really has discovered exactly what that means except maybe an island nearby or the Indian tribe. But this is really, I think it's really like, kind of shows you what logging is, right? Something goes wrong and you want to figure out what? You just checking the log files. It's like your very best thing to do. So this is exactly what that does. I'm going to show our house one more time because we very intentionally put logging behind everything. You can see the logging is blue in color and everything's on top of it because not only does your application, of course, produce log files, but all of your monitoring tools and all your metrics and alerts and everything also can be dumped into logs. All that data, everything can be stored in there. So you can analyze it. You can trigger on it. You can do AI machine learning on it. So you use all that data. So okay. All right. So now we're on to alert orchestration. So to quickly about SLIs and SLOs and what are they and why we use them. So service level indicator is an SLI. An example is simple like an air rate. Just the air rate. That's an indicator. Other examples are like latency or uptime. A service level objective is the threshold that you put around an indicator. So an objective would be you want your air rates less than 5% for 99% of requests over a 30 minute period. That's an SLO. And then now if it goes over that, then you can trigger a notification or you can look at improving it. This also works very well with uptime scores too. So why would we use them? It focuses alerts on the business outcomes and not on arbitrary metrics. So you define what's important for your service and that's what you want to be alerted on. Like you know the rate of stuff that needs to go through. You know your throughput and stuff like that. And you want to get notified of that. You might not want to be notified of a CPU spike if everything's working fine. All your SLOs are being met. You probably don't need. So because of that, it improves the signal to noise ratio because you get notifications of what you really care about and not just regular metrics. I put the site reliability engineering book here as a reference because it talks a lot about SLIs and SLOs and how to define them, how to create them. Highly recommended. There's also a YouTube video from Liz Fong-Jones at Google SRE and it does a great presentation on SLIs and SLOs. Highly recommend. Okay, on to alert orchestration. So the next part of the strategy. So tell me if anyone's familiar with this scenario. Those arrows are notifications. Everyone familiar with that? Yeah. So on this side, these are things producing alerts or notifications. You've got your infrastructure monitoring, your real user monitoring, your log analytics is triggering alerts, and they're all sending notifications from multiple sources and they're all spamming everyone on your team all at the same time, all the day long. So this is without alert orchestration. Okay. So here's what alert orchestration gives you. So instead of sending out the notifications directly to people's addresses or phones, you send it to the alert orchestrator and then the alert orchestrator has on-call calendars and escalation paths and it can notify just the person on-call. And not only that, it can intelligently group if there's an incident and a bunch of stuff starts throwing alerts, it doesn't send a bunch of notifications. It can group it all up into one incident and just send one text or one SMS for that one incident. And then the on-call engineer can look at that and see all of the failed sources that happened. And then that frees up the rest of your team to pursue other things, like sleeping at their desk, child care, doing some yoga. Yeah. So. Finally take care of your kids. Yeah. Yeah. Okay. So here's, yes. So metrics and dashboards. So metrics, so metrics is a very big broad word, right? But one way I like to think about metrics is information that you can query to get stuff out. Yeah. Like Prometheus has the promql. You can query your log files and all that stuff. So you generate metrics. Like uptime is a perfect example. Like I'm going to generate my uptime based on my black box user centric availability monitors. And that's, I think it's pretty self-explanatory. So this is an example of one of the previous doctors, Julia's metrics. So, okay. How is this helping us today? So do you want to switch off or? Yeah, I can do it. Okay. All right. I'm going to pass the mic off to... I'm back. Yeah. I've been behind that computer screen. Cool. Thank you, Jared. How much time we got? Oh, it's 12.30. Man, we can just talk forever. We can just keep going. Hello. Can you hear me? Are we good? Cool. Sweet. Yeah. This is the most important piece, right? So we can present all of this conjecture... Or not conjecture. That's not the correct word. We can present all of this theorizing and theory-crafting around doing this work, right? But effectively you have to show it working. And the good news is that in a lot of ways, this is already helping us today. So Jared is on the team that takes care of our customer portal, access.redhead.com. If you've been there, if you've ever seen it, it's almost completely maintained by all the pieces in our strategy. We'll show you that. Julia and then a few others in here. They're kind of part of a new initiative within our P&T DevOps team to create kind of new pipeline tools. They're using a lot of the different aspects of this monitoring strategy. We're working with a bunch of other internal teams to kind of do this as well. So it's still kind of in action. But this is sort of the meat that we wanted to show. You will notice some proprietary tools up there. And I've got a slide about open source alternatives, of course. But I wanted to show sort of what the tools that we're currently using are currently targeting to use, what they look like. And so where I was saying before, you kind of have a buzz saw in one category, a hammer, a screwdriver. We're kind of seeing that Prometheus and Xavix are incredible tools for monitoring your infrastructure, of course. But we have found that there's other tools, in this case, New Relic Synthetics, proprietary software, but it's simple. It's easy. It's quick. We were talking more about alert orchestration. This tool called PagerDuty, you may have heard of it. It's, again, proprietary, but it does a really cool job. But there's all of these different pieces. The idea is that we could pull one out and replace it with something else. If we were not happy with how it was performing for us, we can actually make that shift and that change if we need to. And it won't affect anything else, which is the most important piece, right? Also, this is a plug for Data Hub. I had to find your logo. I found it from the GitLab instance. That's the old one? I got to go. I'll see y'all. Yeah, this is cool. Well, I pulled it from your avatar in GitLab. So, anyways, I'll need to meet with you and get the new one and some stickers. It's open source. You can create a forward request. Okay, that works. Yeah, that's perfect. I'll do that. I've got Illustrator. I can handle it. So, what is an open source alternative to all these things look like? So, these are all open source completely. Some of them do have an option for you to pay for them, of course. You can do kind of a subscription or do have some kind of enterprise or premium thing. But for the most part, from what we've seen, they can perform the exact same functions. The difference while we're not using them today is because they require actual development time. They require people to be invested in that tool, in that technology. And we are running with sort of just a few people. There's not a lot of us, so we have to kind of work with what we have. Some of these are pretty interesting. Bucky hasn't been contributed to in five years, but it seems to be one of the only solutions except for PyWik, which is now Maltomo. But that's more analytics. But that's more analytics, yeah. So, like Google analytics alternatives, there's kind of one thing. Performance co-pilot has actually been kind of developed and primarily maintained by a few Red Hatters that are based out of Brisbane, Australia. Iris and OnCall, these are interesting things that I believe are developed by LinkedIn. It's pretty interesting. I think the exact same functionality that PagerDuty provides, but it's an open source, of course. Yeah, I don't know. I mean, everything else I think we all know, right? But yeah, so that's pretty cool. Do you want to take this one on? Yeah, this is actually good. I don't know anything about this. I can go through this one pretty quickly. Yeah. Actually, I can go back one slide. I'll just kind of follow you. Okay. Yeah, go back one slide. I just want to point out one thing about this. Why there's Prometheus and Selenium in the availability check. And like I was saying before about availability, so if you want to do the user-centric type of availability with Prometheus, you can do it, but you have to just set up a Prometheus instance in an external location, like an AWS external location. The most important thing is it's not running on the same infrastructure as your app. It has to be running from the point of view of the users. And then Prometheus has a black box monitor that can go out and check endpoints and stuff, make rest requests. But Selenium WebDriver is on there because the other thing you have to do is, since it has to be black box, if you have a web app, you have to do user interactions. Like if you have a search run end, you have to go to that web page, type in a search request, click the button, make sure the search results come up. I don't think Prometheus can do that. I don't know. Maybe someone can tell me if it can, but I don't think it can. So that's why you need something that can do that. So Selenium WebDriver is one alternative. And New Relic Synthetics uses that under the hood. So I just wanted to point that stuff out. Okay. Oh, this is just a very, I won't spend much time on this because it's like a messy. This is the customer portal monitoring strategy, which is following all of the things. We've got external synthetic monitors, real user monitoring from the outside into the CDN so we can detect those middle layer problems. And we've got APM infrastructure agents running on our origin hosts. We also have synthetic monitors hitting the origin hosts in different environments, and everything is getting locked. Okay. This is the, so this is the kind of a wrap up of an incident, of an example of a real post-mortem, which I'm actually pulled directly out. So every time we have a major incident on the customer portal, we require a post-mortem. And I took this screenshot directly out of that. So first thing that happens is I get, so I'm the actual on-call person when this occurred, right? And this actually happened before we have alert orchestration. So I get like 10 message, like push notifications on my phone saying the portal components are down. This is the red bar there. This is taken after. The screenshot was taken after the incident ended. So that's why, you know, it was failing at the time. So, oh, and this is really important because going back to the status page, so going back to what Adam was saying at the beginning, you want to make sure that your status page reflects the operational state of your app for your users. You never, ever want to be in the situation. If you have a status page, you never want to be in the situation where it's unavailable for your users and your status page is saying it's operational. That's the worst. Like, that's very embarrassing. Yeah, so this is the screen. So since we have these synthetic checks hitting all the components from the user's point of view, as soon as they started failing, it updates the status page so it reflects it. So now we have a major outage and we know it's affecting customers. So looking over at rum, how is this affecting users' experience? The loading times of the site has almost doubled and throughput has gone down. So we know it's how users are feeling it. How are they feeling this in their browsers? Okay. So tracing into the synthetic check, why did it fail? Well, it got over 500 response code back. You can see it right up there. I'm like, okay, why is that? I'm going to look in it and this is still availability. So next step is check at APM. APM gives a check, one of the failed transactions gives a call stack. I see exactly where it failed. It failed the database connection construct. I'm like, hmm, okay. The PHP app cannot connect the database. All right. So what's wrong with our database? Now this takes us to logging. So we see in the logs exceptions, coming 9,000, over 9,000 long messages in the last 30 minutes saying PDO exception connection refused the database. So that's like we're really close now to the root cause. There's something definitely the PHP cannot connect the database. So what's going on with our MySQL server? So now we hop over to infrastructure of the MySQL server. And we see, oh, dang, look, the threads went skyrocketing, but MySQL threads went skyrocketing. And the slow queries huge spike, right? Also CPU spiked memory went nuts. And so we know that and then it just died because that's why it just the server just died. We know that's what happened. MySQL server is dead. So we found the root cause. All of that took less than 10 minutes. From the moment we get the signal that there's an outage because we were monitoring holistically, we found the root cause in less than 10 minutes. Just like boom, boom, boom, boom. Okay, that's it, right there. And not only that, we're able to quickly resolve it because we knew we found the root cause. The quick fix was just to restart the MySQL service. And then on top of that because of our APM tracking the performance of all of the SQL queries being executed on our application, we could find the exact place the slow query was happening, which file, like which path, which query exactly, and create an issue for it. Engineer fixes it. It took it from four seconds to 70 milliseconds. Done, right? That's the power of holistic monitoring. So summary. So the main messages of our talk are think of monitoring by functions, not by tools. Don't think of monitoring as just Prometheus or Xabbx or New Relic or whatever. Think of it by infrastructure monitoring, APM availability, logging, all of those things. That's all the functions that you want to make sure. There's tools for all of those. The tool is kind of irrelevant. The most important thing is that you're covering each of the functions. And if you do that, you catch the signals, you, yeah, user-centric box availability monitoring, and you improve your signal-to-noise ratio, and you make your ops team happy with alert orchestration and all results and happy customers. So, OK. Questions. How much time do we have anyway? Oh, perfect, yeah. Alert orchestration? Yeah, pay-to-duty. Well, pay-to-duty or iris or uncall. Yeah. So, all right, the question is, how do you get the notifications from your monitoring to your alert orchestrator? And the answer is the alert orchestrator. So each of the, depending on the monitoring tool that you're using, it can have various integrations already built into it. Prometheus, as an example, can send, just out of the box, it has a pay-to-duty integration, and all you do is you get a pay-to-duty service key, you put it into your Prometheus config, alert manager, thank you, and then it will send notifications directly to that, and then it will match up in pay-to-duty to the service. And if it doesn't have that, there's other ways of just sending it directly to a specially-encoded email address for service in pay-to-duty. So it's to the monitoring tool, it's just like sending a regular email notification, but it's sending it to your alert orchestrator instead of a directly to a person. Yeah, cool, yeah. I, yeah, yeah, that's a great question. Yeah. Oh, yeah, so the way we manage it, I'll tell you, there's, yeah, thank you. Yeah, all right, so the question is, how do you involve dev when an ops on-call person has, when there's an issue with ops on-call? But they might not be the right person, like a developer would be the more appropriate person to actually figure out what's going wrong. Yeah, right, yeah. Yeah, yeah, so, so I'll tell you how we do it on the customer portal, that's a really good question. First of all, the, in like, okay, so in the case of the post-mortem I gave you, I was the ops person on-call when that happened. And when I saw that it was Drupal not able to connect because of slow queries, I reached out to the Drupal developers, like right away, I mean we always do, like we'll involve them. Like, and there is an understanding with that dev teams that we're not gonna always be the ones fixing it, like, but especially if it's something wrong with their code, that they, and usually they'll do the more long-term fix. The DBA did the restart of the service, but we have, for each one of our services, we have a non-call person in each time zone, but for each service on the customer portal, so like the knowledge base was that Drupal, we have the tech leads, the SMEs, for each geo, listed in our escalation paths. So if one of their services or applications goes down and the ops person doesn't really know, they're responsible for making sure the incident is handled, but they'll reach out to the dev teams oftentimes to fix it or figure out what's going on with their application. So, does that answer your question? Yes, yep, yep, yeah, exactly. And we, right, yep, yeah, we've done that. I think it can be, yeah, the question is, how do we feel about automatically restarting services based on monitoring telemetry? Yeah, so, yes, I think it can be useful, I think it can also be dangerous. Especially like we have a solar, the very vast solar index on the customer portal, all of the knowledge base, all the documentation. And if you restart that thing, oh man, it takes a long time, right? So you would never want to do that. But there's other times where maybe, it just depends, right? It depends on what's the impact of restarting automatically. If it's a low impact and it will fix it, then I'd say go for it. But if it's something, if it's a big system that takes a lot to restart, then I would be careful about it. Yeah, yes, I know, yeah, the question is, do we have an inventory, like a manifest of all of our hosts? Yeah, yeah, we do and we don't. I think infrastructure monitoring, I think that's actually one of the functions of the infrastructure monitoring function is actually if you have infrastructure agents monitoring all your hosts, you kind of get that side effect of that as you get an inventory, because you're listing them. So I think I would say that's part of infrastructure monitoring. I know a previous could do that, right? Oh, geez. Well, I think we're out of time anyway. You're right. We need to thank you, so I have one more question. We need our thank you slides. Great question, yeah. So the question is we have multiple tools monitoring different functions, but does that introduce challenge, correlating problems across the tools? And I think in my experience, that only adds a little bit of extra work, because we know which, we really know which all of your tools are. It's just a matter of like, it's a little bit more inconvenient because you have to go to a different URL for this, right? You have to check Grafana here, you have to check APM here, but it's really just a little bit of extra time to go through each tool. If you had everything like in all and one, and you could trace through, it would save you a little bit of time, but the end result would be the same. It's not too bad.