 Good morning everyone. As she mentioned, I am a developer advocate for LogzIO and if you're wondering what LogzIO is, we provide Elk as a service and if you want to know more about the awesomeness of that and the additional features we add to the Elk stack, please feel free to see me or our table out in the main area there. And now we're just going to shift right into what I am going to be talking about this morning, which as she mentioned is sensory friendly monitoring. So this talk was inspired by a situation that I had developed at when I was actually at a startup, which looked a little like this. Who's been there? Yes. So as you can see, what basically ended up happening is as a small startup, we said we have a small team. So we don't want to risk anything going wrong. And in order to avoid anything going wrong, we're like, oh, we need to make sure that it tells us everything. And so we set it up to tell us everything. And unfortunately, the flip side of that, as I'm sure you know, as you laugh is, oh no, everything's too loud, no one's following any alerts, people are muting Slack, people are muting their text or paid your duty. And that is like, now it's not telling us anything at all. So what, you know, the problem we started to encounter was when we try to segregate duties as well as we want to make sure that we actually have a good flow for this to avoid the problem. I'm going to keep shifting this mic a little to avoid the problem of too much noise. That was just overwhelming our then 15 person engineering operation. Because as again, no one was receiving what they needed to hear. Unfortunately, our first attempts were a little futile as well. You know, we ended up turning down the alert volume a little too much. And then kind of like this ingrained suspicion we all now had right where we were used to a noisy citizen. And then our system was quiet. And everyone was actually just as anxious. So we tried to find a happy medium and avoid situations like this. When we were trying to more to approach the situation from a place of awareness, we tried to consider the cost of what was going on so that we could actually understand what was actually harming us so badly and figure out where it was costing us, and then move forward based with that information. So one of the things just to kind of provide some additional context is what happens when you have a distracted brain. And I make the joke in context about brains and alerts, but this is true of distracted brains, right? We evolved as a species. We wanted to avoid predators. We're very hypersensitive to our environment. We're very hypersensitive to things wrong in our environment. And what ends up happening when your brain is trying to take in all this stuff, even when you think you're not paying attention to it, right? Like, have you ever lived with a person, family, spouse, it doesn't matter who who was really obsessed about a topic. And suddenly you realize, I've never read about this topic. And I think I know things because your brain was just pulling in from just your environment in this way. So studies have been done because that's how we help ourselves out. So on average, let's say you have a small interruption. And I mean small, not something that causes you to entirely switch up what you're doing, but just a small, you know, Joe developer walks up, hey, I can access the internet. And they're like, the Wi-Fi password is x, y, z. And they saunter off. And they're like, great, I'm going back to confirmation templates. Where was I? 25 minutes on average. Because the idea is when you're returning to the state that you were trying to be in before you were no longer in that state, your brain has half shifted focus upstairs, whatever mental model of your project that you were working on, it's no longer the same in your head anymore. You have to try and reestablish that connection with your mind. And this takes time. So in terms of being able to put a number to the time, it's 25 minutes on average every time someone walks up and gives you a quick little jolt of something else to do. And this was done by UC Irving. The other issue we have is there's a quality degradation. And this was again tested actually by Mason University, George Mason University. And what they ended up doing is they actually had AP students taking the written portion of their tests. And they knew about the 25 minute study that had already been published. So they said, okay, let's give them their time back. Show of hands, how many people think the scores bumped up after they gave the time back? A couple of you hopefuls. You're very hopeful and positive, and I like you. It did not. So what ended up happening is they had, you know, the students taking the test, they did their written assignment, they have the interrupted group, they gave them their time back, and their score did not change. Now the uninterrupted group consistently performed an average of one to two points higher for those who don't know AP scale on a scale of six now, apparently. So you're looking at a scale, a very small scale, right? And so a couple of points is actually rather significant for this type of test, which also kind of slides us into what happens when you try to multitask. And this, you know, flipping back to quality degradation, but also what happens? Are you able to actually focus? And so this study was interesting because what they did is let's say instead of, you know, cloud formation templates or Kubernetes or let's say you're not doing anything completely complicated today. Today is an easy maintenance windowy day, and you're just, you know, sliding through the day and it's fine. It's a repetitive day is what I'm getting at. So this study followed people who did a repetitive task that was short, had three steps, and then they started trying to throw in tiny little interruptions, and they found that people could not complete the repetitive task. Their error rates just skyrocketed. And again, simple task. And for their case, it was intentionally repetitious, right? So whatever tiny little tasks they gave them, they were intended to repeat over and over and over again. It wasn't like they were doing different steps in a multi-step process. So now you're left with, okay, so if I have alert fatigue in the context, bringing it back in, I now have quality degradation, right? I have distractibility. And then I have decreased quality of work and perhaps the inability to complete even the basic tasks of what I do for the day, even when I'm not architecting something new. So we really, really, really, really need to tone it down. And how do we do that? How do we tone it down so that we're actually still getting the information we need? Because going back to Squidward, just since I never thought I'd say in public, going back to Squidward, you know, you have too little noise, especially for a team that's become accustomed or perhaps maladapted to a very noisy, very sensitive learning system, when you start reducing the noise, their anxiety is not going to go up, but... So this is what we started to do at the startup. First of all, we wanted to know where our noise was coming from. You know, we knew how we implemented our app, right? No one, for the most part, no one knows better than you and your engineering team how your stuff is implemented. So we knew what infrastructure platforms we were using. We knew what languages and libraries we knew within a fair degree of margin where we were ever prone, what things were kind of legacy and fragile that we had just brought in, right, as a stopgap solution and things like that. So we knew the sources of our noise. And then we started to categorize them so that we could say, okay, well, this type of noise is infrastructure, right? I mean, this seems kind of basic, but you actually do have to take the time and actually pull this information out of your system rather than kind of feeling it out. You actually need to sit with the team because there are some cases where you're like, I don't know, someone starts throwing 500 errors and you're like, okay, well, is this a code? Did the code deploy go wrong? Or is this infrastructure? So you actually really have to know, you know, what category you're putting different types of alerts in. And then you want to make sure the purpose of the categories is to so that you're working well. So you want to say, okay, well, now I know who the engineering point person is, and I know who this person is within that team kind of going down the tree. So you might start with, you know, external to the company facing, this is the person in engineering, that's the on call, for example. If someone outside the company knows that something's going on, and then underneath them, you have people responsible for the infrastructure, the front and the back end, things like that. You also, and I can't stress this enough, you need to make sure that you start clearing clutter. And I'm going to be coming around back to that in a minute. So this is what a subset of a place I worked at and their noise systems kind of looked like. You have your traditional stuff, right? Hey, I need to know when there's a bug in JIRA. Hey, I need to know when there's something going on my elastic stack. Hey, it's going through Slack, email, you get the point, right? So you're going to have logging and ticketing and things like that, but you also need to be mindful of your own noise as well. And this is something we had a discussion about as a team, because it wasn't necessarily most straightforward, in the sense that we're talking about our noise. And so we think about things like AWS or Google Compute or pager duty or the things that I just mentioned on that slide. And then we decided to have a meeting about our own distractibility, right? And this goes back to the incoming noise that's unrelated to work, but still uses some of the same mechanisms, right? You have notifications enabled on your phone, which allows pager duty to push a notification, but also allows your mom to say, please pick up the milk on the way home or whomever, right? So when you're talking about your own personal noise, you actually can be muting things because of your noise that are also avenues for, you know, these monitoring systems to be pushing notifications to you as well. And so part of the way that you can kind of push that aside is you actually need to start establishing some boundaries with, you know, just know yourself and know when you work and what you want your workflow to be and what you have experienced your daily life at whatever company or at's workflow is likely to be. And the idea for this is some people, for example, peak in productivity and creativity in the morning, some in the afternoon, depending on a whole bunch of biology worth not getting into. But when you think about that, you're like, okay, I'm most productive at 2pm. So I would really prefer if spouse, other family, you know, other coworkers, unless it's urgent, don't bother me for a few hours around the two o'clock mark. So then when my brain is on, I can devote it to the tasks that really need that heightened creativity. And then I would try to schedule meetings around when I can be more of a recipient than a generating person, right? But also part of this means that you need to let you, you know, these people know what you consider to be urgent. For example, they probably wouldn't be pinging you if they didn't consider it to be at least somewhat urgent, right? Unless they just want to gloriously send you an urgent, which also fine. So when you're telling your coworkers, hey, please don't bother me for two hours, you need to give them something. You need to say, this is my backup. If, you know, these high priority items go wrong, and if all of that fails, then come to me, you know, give it, give them some sort of boundary to work around for family. Be like, unless you need me to leave work, please don't contact me during this window, that kind of thing. So that people kind of know what you're considering to be urgent for you. But switching back to the external noise, like that, right? When you're categorizing your noise, you probably will start to notice that beyond just knowing what type of noise it is in terms of infrastructure versus, you know, front end versus whatever, you'll also start to notice that some alerts are better behaved than others. We certainly did, right? We mostly ran into the false positive category. I'm just going to quickly give a little example of that. So what happened? One of the developers wanted, had roughly calculated about if hypothetically all the microservices were to connect to this RDS instance, how much memory should it be using? And if it's not that he set an alert to ping because his logic was if the memory usage falls below that, that means someone's not connecting and something's offline. Now, of course, me being somewhere else on the team, I didn't know he had done this. This is one of those things. Startups run lean, right? So everyone has a certain amount of access to everything for a little while when your team super tiny, like five. So, you know, you get that four message from Peter duty that says RDS instance. What? And I'll just kind of flagged it and said nothing's offline. So I'll just ask questions in the morning. And that was the conversation that resulted. So then I said, Okay, now that I understand your logic and what you're intending to do, I'm going to point out that we actually also have alerts hooked into our logging system. And let's alert on failed database connections and stuff. And that was more accurate, right? So the intention, we still captured the intention. We didn't lose it, which was to find out if a microservice wasn't connecting. But that required a conversation and knowing what was going on and knowing what we expected. And the same thing falls through for like the negatives and the fragility where you're saying, Okay, if you have a fragile system, even with the best scoped alert system in the world, it's still going to be noisy. And it's still going to be distracting. And you're still going to have a really hard time picking out, is this actually a problem problem? Is this just kind of a hiccup problem? And you have to know what systems are fragile and set up your alerts kind of accordingly, and make them as noisy as they should be with that taken into account. You don't want to get pinged at 4am. For an RDS instance, that's not actually down, just as an example, right? If it's frequent, you should probably, and this might again seem a bit redundant, but you want to make sure you're taking the time to fix frequent errors. If you notice something's being noisy, just because no one has the time to fix it, you probably bump up the priority of that in your DRQ or your queuing choice and get it done. And then at that point, just by virtue of fixing it, you'll resolve that noise source or type. When you're trying to establish your flow, you need to make sure, and this goes back to the RDS example, we need to know who needed to know what. We need to know why they needed to know it, and we need to know how quickly they need to know it. And the developer made some assumptions when he spun up that because, you know, and it's not from a bad place, right? It almost never is. You don't normally run into a situation where you have, you know, Joe developer who's over here like, ha ha ha, I'm going to wake up versus ads. Usually it's from a place of, I think, a certain, I have a certain mental model of our infrastructure and our app, or apps, and this is based on my understanding is how I'm going to implement a thing, and then someone else is actually impacted by that decision. We all need to make sure that we're having the conversation where if somebody over here can impact somebody's work for over here, they need to know what everyone's logic and expectations are for that. And I knew as the infrastructure person that I didn't need to be woken up over the memory issue when it wasn't an issue. And so I just told him, and we figured out what to do. And it goes back to, you know, the noise level, if the database had legitimately been down, not just a memory threshold issue, but like offline, there were other errors popping up, then I would absolutely expect pager duty or your system of choice to be loud about it, right? You never want to be in the false negative space, where you get human notifications of salespersons or whomever, hey, customers are reporting, right? And after spending all that time talking about how you don't really want to have a ton of noise, I am now going to talk about redundancy, because this was fun. So who uses Slack a lot? Cool. Who enjoys at least-ish Slack? About half of you. Nice. We're going to pick on Slack a tiny bit. And I do love Slack, which is why I wanted to, you know, I do, you know, it does what it needs to do. And it's a mostly available system. This situation then in the happening was rather rare. And I don't know how many of you and your workflows were awakened aware of these time windows, but basically in the early mornings of Eastern time, Slack went down and it stayed there for half a day. Now, we in Logs.io are pretty flexible internally. So although we didn't necessarily think to ourselves Slack is unreliable and we're going to plan around that, we do have the ability to switch up endpoints and get other things accomplished. We don't have everything just dumping into Slack during the day. And I mentioned this because a prior startup I had worked at, we did fall into that trap briefly until we, you know, reassessed and did other things where we had low priority alerts putting in Slack during the day. Now, with a Slack outage, if we had not resolved that in advance, we would have spent half a day figuring out how to switch endpoints and or just an a noise vacuum and just kind of hanging out, sipping coffee and letting the caffeine raise our anxiety levels. So you want to make sure that when something's going down, you're looking at your tooling. And that's of course not about Slack. People use AWS, you can use SNS notifications and things like that. If something goes down, you need to have the ability to switch it up. You need to be able to send it somewhere else so that while your service is down or your third party services down, you're not left in the dark. And if it happens often, so not to pick on Slack, there are tons of other services. I'm sure we've all used a service or a third party service that we noticed over time was not as reliable as we initially thought it was going to be. We need to be able to reevaluate and we need to be able to architect around that so that when something's not behaving as expected, we can within reason easily being a relative word, pull it out and replace it with something else. So for example, if Slack stopped being at least production awesome and started going offline all the time, we would definitely need to be able to switch to another chat system, right? That just kind of half goes without saying, but we want to make sure that we can. And part of this is your engineering team, when they start to adjust to how the new flow of things is going, you don't want Squidward anxiously biting down on his clarinet thinking, why is it so quiet and blasting it into the vacuum of the noise because we want the resilient noise to build trust. You want when it's quiet, everyone knows that it's quiet because it's actually quiet. It's not quiet because something's failing and you don't know. It's quiet because everything is fine. Yes, you can work on that side project you've been putting off for like three months. So when you're, you know, again, going through regularly evaluate reliability of your tools and your services cannot, you know, stress us enough, this goes for internal services as well. I worked at a past company where we know and it's the third party solution we were using for some image scaling was just not functioning quite up to par. We'll leave them anonymous for the purposes of this chat, but we ended up rewriting and rearchitecting and spending the time to put something and implement it internally because we didn't, that, that was just enough of that, right? So reevaluate the internal service, make sure it keeps up and running. And I mentioned that I was going to get back to this. So I mentioned earlier that you need to regularly clean out the clutter, right? Your app is always in a state of flux. It's always changing because your users change because you change and oh, it started off as a really easy idea to just have this app to only this type of, I don't know, financial interactions. But oh, now we have a new potential user base and let's change our requirements. And suddenly rearchitecting and changing things, your alerts may not be alerting on the same endpoints or using the same things anymore. You do not want to just leave them there and let them just kind of hang out. And sometimes this is more obvious than other times because some change is very drastic, real rearchitect a whole solution and maybe you'll switch from, you know, individual dedicated EC2 instances to something like Cloud Foundry and the whole thing's different. And you think to yourself, of course, yes, I'm changing my alert system, but when you're making tiny changes and maybe you have alerts based on more subtle behavior, you definitely do need to take the time and say, okay, did this change anything that I'm checking on metric or monitoring wise? And do I need to make any sort of adjustments? And the reason I nicknamed this sprint cleaning is my suggestion is to actually have this part of your sprint process. If you do it regularly, it's kind of, spring cleaning is a pain for people. I'm just going to switch that up right there. Spring cleaning is a pain. If people just do spring cleaning, because then your house has accumulated clutter and gets a cluttered up and then it accumulates clutter and you get and it takes a long time. But if you do it every week, a little at a time or every other week, whatever your your sprinting schedule is, then it's not going to be a massive thing to fix. Aside from maybe the first time, it's going to be maybe five to 10 minutes of reevaluate what happened. What alerts went off during the week and are they still relevant? Yes, no, maybe so acts anybody who's not very quick. And that will keep it from becoming a beast again, assuming that you're in a position where you have to fix it from being a beast now. For those of people who like reference slides and things like that, I recommend photographing this slide. This is some of the reading that went into it. It has some of the studies that I reference. It has the Google SRE handbook because Google, but also because there are a few chapters in that about monitoring and alert fatigue that I thought were actually very helpful. And with that, thank you for your time. Does anybody have questions? All right, so I can understand how a lot of this works in a smaller company where you have like five, 10, 20 engineers, but I work in a company that's like more than 100 engineers. So getting everybody kind of like row the same direction or paddle the same direction or whatever is really difficult, especially in this logging, because we have like application performance monitoring, system monitoring, log monitoring, like, and me personally, I don't care about any of it because I don't really care if somebody like, I'm not in charge of the, I'm not on the hook if the application stops working. Okay. Indirectly, I am, but you know, like, I'm not going to be woken up at three o'clock in the morning. So how do we get everybody else to kind of like start caring about this and kind of consolidate our, would you recommend consolidating our logging approach or having each team have more ownership or what? So the answer to that is, and this isn't an escape answer, I promise. It's flexibility. So I've also in the past, I worked at a consulting company that did ops consultancy. So the company itself was small. And then we would contract out to companies of varying sizes. So for the large company, like what you described, one of the things that we did with success is that company had multiple teams, but those teams were sometimes cross responsible for the same apps and sometimes responsible for their own dedicated apps. And when we were architecting the solution with that in mind, we said, okay, things that are logically contiguous need to be together. So in that company's case, there were some teams that we needed to bring together into a room or a Zoom call, right, in the remote case, and have a conversation to get everyone kind of on the same page. And it wasn't a singular conversation either. This isn't a conversation that you have once and everyone changes their behavior. Yeah, it takes 30 days to make a habit. It was something more that we started to say, Hey, everyone can acknowledge that the dysfunction is here. So let's have a conversation so that we don't have all this going on all the way. And you're not getting notified about your app when you can't take any action on it anyway. And just putting it in there and using empathy and putting it in a way where they can understand, Hey, this benefits me and you was really helpful because they don't want to be notified for things they can't take action on. It stresses them out. They have to find the person themselves now, right, or else it's just going to dev not right if they don't do anything about it. So that's what I would say, you know, for the large companies. If it's a large company that's working on a singular thing, you know, I would have the different teams working on that thing have a conversation so that you can say, Hey, now I no longer get alerts on your stuff and you'll get mine. And if it's a large team working on a bunch of different things, just keep them logically separated in that way. Does that make sense? Yeah, one more time for quintessence.