 Can you hear me okay? Yeah. All right, I just wanted to preface this with I'm the only thing standing between you and lunch. So, let's keep that in mind when I drop some jokes. I want to get some feedback from you guys. I just kind of want to get an idea of like who I'm talking to in the room. So, by a quick show of hands, how many are in an organization that are 50 people or larger? All right, keep those hands up. 100 people are larger. 200, 500, 1,000. All right, sweet. So, for some of the large organizations, you guys might be used to some processes like this. For some of the small organizations, it might be something new to you. So the title of my talk is brainstorming failure. And when we look back in like 10 years, we're gonna say like, man, this was like the golden age of monitoring, right? Like we've got a monitoring conference now and monitorama. We've got these awesome tools that just keep springing up. People are having meaningful, useful conversations about monitoring in our communities. But at the center of all that, monitoring is really about detecting failure, right? And as we continue to build these more and more complex systems, the failure modes that come up are even more complicated, more nuanced. And we have to figure out, what do these failure states look like? How do we monitor for them appropriately? So we're gonna talk about a process that we use at Grubhub that we've been piloting with a lot of our development groups. Really just to get an idea of how can we come up with the best way to assess and prioritize the different failure modes, their detection, and their mitigation? Preferably automatically. So before we get started, quick intro. My name is Jeff Smith. I'm the manager of the Site Reliability Engineering Team at Grubhub. We're here in Chicago. Yes, we are hiring. Yes, we have free food. Yes, it's awesome to work there. If you don't mind putting on five pounds when you start. You know, we got a lot of interesting problems. I don't know if any of you went to DockerCon. We gave a talk about our infrastructure at DockerCon. I'll post a link to that talk as well from one of our SREs. You can find me at Twitter at darkandnerdy. And my blog is all things dark, dark, how? Gotta call it out, right? You gotta call it out. So before we get started, I just kind of wanted to come up with like a quick metaphor for what some things look like, right? So imagine you're a passenger on an airline, right? You're getting on the brand new 947 passenger jet, right? It's everything. It's got all these first-class luxuries, right? All the bells and whistles you could possibly imagine from a customer perspective. It's got top-of-the-line internals, right? Right off the gate kind of like new, innovative, top-of-the-line technologies. And then they're like, okay, now that we've showed you all this, we're gonna take you to the nerve center of the operation. We're gonna take you to the cockpit where the pilots do and manage all of the systems. And you get in there and you're like, what? Right? Like, this isn't a cockpit. It's a room with a stick, right? Where's the instrumentation? Where's the information? You know, how do you know that things are working? And this is what a lot of our systems unfortunately look like, right? You go into the knock and you say, hey, how's the system running? And they go, well, no errors, right? So now let's take that metaphor, right? You're going to the cockpit like, hey, how's everything going? Like, well, we're not losing altitude. So I'm gonna assume everything's good. No, that's dangerous, right? You want feedback. You want information. You want bells, whistles, gizmos. You want information coming back to the operator of that system so that they can definitively say, yes, the system is operating and here's why I know and how I know. Not just the fact that there's an absence of errors. So feedback is something that you're gonna hear me talk about a lot in this talk. But what do you measure, right? As a system, how many devs do we have? How many people identify themselves as developers? Okay, and I'm guess the rest are ops, folks. So as an operations guy, if you haven't started this dev ops transformation, right? A lot of times you're at the tail end, right? And metrics and monitoring conversation is something that kind of happens at the end. Like, oh, we've got the system. What do we measure? What are the first things you think of? Shout them out. CPU, utilization, response times, latency, all that stuff, right? But guess what? When I'm ordering a pizza, I don't care about how much RAM you're utilizing, right? I wanna know that when I hit the order button that everything goes through the pipeline so that my pizza shows up 25 minutes later. That's all I care about. But a lot of times that kind of gets missed in the conversation when we talk about things to measure and monitor. So what we've done is we've taken a process called failure mode effects analysis. It's a mouthful, I know. It's a quality organization tool that came up, I don't wanna say it, but from the six sigma groups. But it's this great process that kind of walks you through the idea of identifying all your failure modes and assigning particular values to those so that you know where you have exposure. Now, when you look at this process, you realize you can use it for anything, right? It doesn't just have to be monitoring. Most of the times it's for like production lines and things like that. But when I looked at it, I should caveat this. My wife is an industrial engineer. So when she started talking about this process years ago, I'm like, wait a second, we can use this in technology. Why isn't anybody doing this? And she was like, I don't know, shut up, stop talking. So I finally got an opportunity to say I'm gonna try to use this process, build it and use it for software engineering. So what we've been using it for Grubhub is to identify key metrics that need tracking, monitoring alerts that can be created as a result of that and identifying all of the necessary feedback loops that need to happen to make sure that everyone has information about the system. So what we start with is we start with a cross-functional team, and that's something that is very, very, very important. And it's important because it's easy to look at a system and just think about the administrators and the developers, but the system is more than that. It's the users, right? Because they have a completely different view of the system than you do. So what you say CPU utilization, the end user says it's sluggish, right? But that could manifest its way in a number of different ways. So you kind of need that end user's perspective to see how they view the system. Designers look at the system differently. Developers look at the systems differently. Administrators look at the systems differently. And because of their different views of the system, each of them is gonna have a different idea of what failure looks like. So it's very important that you have these cross-functional teams to kind of bring everyone together for this process so that you have an idea from all viewpoints of the system on what failure looks like. So that's a very key part. So the process at a high level is to basically examine what process you're actually failing for, right? Could be a number of things. You can drill down as deep as you want. You could go through the high-level call chain between servers. You could drill down into a specific function if you wanted to. There's an endless number of modularity you can kind of do this with. The example we're gonna walk through is gonna be fairly high level to give you an idea of what we're talking about. Once you examine the process, you'll brainstorm all of the potential failure points in that process. We're gonna list the potential effects of each failure, identify scale, which we'll talk about in detail in a minute. And we're gonna assign what we call the severity, occurrence, and detection ranking, which is going to help us figure out just how much exposure we have with this particular failure mode. And then from there, we'll use that to develop an action plan and figure out how to get these things fixed. So starting with examining the process, we have a very simple straightforward process, right? User makes an HTTP call to a web server to make a payment, right? Our web server gets that request, makes a payment, it makes a call to an external API provider to verify that that payment is good. We insert a record to the database to say that we've done that transaction and then we tell the client that everything is okay. So even at a high level, can anybody shout out things that could go wrong here? Anybody? Network. Network. Database. Web servers down. Third party API could be unavailable. Garbage in. Right, garbage in. So we're already coming up with failure modes and we have just looked at this really, really high level process. So what we do is we take the cross-functional team and we do exactly what we just did. We say let's brainstorm everything that we think could go wrong and we're not worried about feasibility at this point. We're not worried about what the likelihood of this happened. If someone says, well, one time there was a full moon and there was a dude that was moonwalking on top of a cab and then the server went down. Type that shit up. All right. Make sure PYT doesn't play. Got it. But we brainstorm all of these different ideas. I personally use mind maps. Some people use outlines. Some people just do sticky notes. Number of ways to do it, but whatever your brainstorming tool is, get all the ideas you possibly can out there. Once we do that, we kind of categorize those different potential failures because sometimes you'll get duplicates or things that are pretty much similar and then we'll go through and we'll list all the potential effects of a failure, right? Because the external API call returning a 500 doesn't mean anything, right? What is the impact of that? Well, we don't know that the credit card is valid or not, which means we reject the order. That's problematic, right? That's lost revenue. So we detail what the effect of that is. It could be degraded customer experience, order not fulfilled, delaying payments, whatever. But just list out all those effects and sometimes one failure mode might have multiple effects. List them all out because it's important to know what the worst of those offenses are. And from there, we agree on risk level and this is where it'll be interesting for some of the larger organizations. For Grubhub, we're still small enough where we can kind of individually for each team come up with a risk factor scale. And it's basically like once we go through and rank these effects, we're gonna rank them on a scale of one to 10, right? Low being very minor inconveniences, 10 being something terrible, right? But depending on your industry and your team size, that could be different, right? So for us, a 10 is the site outage, the site is down, someone's not getting pizza, right? For the airline industry, they're like, yeah, plane crashed, right? And that's very, very different than someone not getting a pizza. So it's important that you agree on what your scale is when you start talking about assigning these rankings. One of the common techniques we use is we assign this score, like let's say we're talking about the severity of a thing. We say a 10 of severity means 80% of orders can't be processed, right? A one means that order processing is slightly delayed or there's some note that doesn't get added to account. But that way we have an idea of what that scale is. For larger industries or for larger companies, it may be a good idea to come up with the scale for the entire company. And the reason that is is because as you start dealing with resource allocation and whatnot, you have this sort of number that ranks all of your risks across the organization. So if you have a lot of shared resources, you know centrally, like this is the specific issues that we need to address first in terms of our system's health. So the first thing we're gonna do is everything is gonna get assigned a severity ranking. Severity is basically if this happened, how bad is it? One being a minor inconvenience, 10 being a catastrophic failure. In some organizations they reserve nine and 10 specifically for customer death. In the pizza industry, I don't know if we gotta deal with that. So I'm a little fast and loose, but someone did, right, allergies, right? That could be a thing, right? The minute you start serving up peanuts, you need to know, of course, if someone ordered a peanut sandwich though or something, right? That's kind of weird. But it could happen, you're right. And if a failure mode has more than one effect, then we're always gonna take the highest ranking for that particular effect, all right? So, all right, continuing with our example, let's say the database server is down. How would we rank that? Probably rank it an eight or nine, right? Cause now we're not processing credit cards. So that's a pretty big deal. Then we move to the occurrence ranking. The occurrence ranking basically details how likely is this thing to occur, right? One being it's extremely unlikely. We got a dude moonwalking on a taxi cab, but it's never gonna be a full moon when he does that cause he's afraid of full moons or something, right? That's a one. But a 10 is something like, well, yeah, it's garbage data, right? Someone's gonna give us garbage data eventually. It's gonna happen, all right? So we might give the database failing, you know, a five, right? Cause it could happen, right? Anybody have a database failure this week? Yeah, you know, it happened. So we'll give it a five. Then there was the detection ranking. And the detection ranking is kind of important because people get missed on like, oh yeah, we would know if that happens. The key is would you detect it prior to a customer detecting it, right? And that customer word is funny because a customer could be the person sitting down the hall from you, right? They're your business customer. Could be someone in Wyoming, right? Because they're trying to order a pizza in the middle of the night. But whoever your customer is for this particular process, what is the likelihood that you would detect this problem prior to them finding out? In a lot of cases, you're like, wow, now that I think about it, I would never detect it before the customer, right? I mean, let's be honest. How many people who have a system whose monitoring is basically their customer is calling and yelling, right? Oh, come on, we're at DevOps stage. You can be, we're friendly. There you go. Guy back there. You all know, you know. From there, we calculate the risk priority number. And the risk priority number is basically this value that we can assign to this risk so that we know where it falls in the sea of other risks. Because again, when we're coming up with terms, the guy moon walking on top of the taxi, yeah, we should address that, but we probably shouldn't address it before the database fails, right? So what we do is we take the severity, the occurrence ranking and the detection ranking, we multiply them together and that gives you an RPN number. When you go through this process, I promise you there are gonna be things that just shoot up to the top and it's so enlightening and so like refreshing to be like, wow, it's very clear what we need to work on, right? So from there, we need to develop an action plan. We prefer to prioritize solutions that are self-healing, right? Because now we have a failure mode. We know what the failure mode looks like roughly. We know what the effects are. Now we have to figure out how do we actually go about solving it? And the goal is to reduce that RPN value by either lowering the detection, the severity of the issue, or the likelihood of occurrence, right? So there could be a number of ways we could do that, right? Like we had the database at a five. How could we mitigate that as a potential issue? We could come up with master, master replication, right? We could switch to Cassandra. We could lower the severity in a way that if the database isn't available, we approve it and just drop it to a queue and process it later, right? There's all these levers that we can pull. But the idea is to think about the solution and say how can we either increase our ability to detect this, lower the severity, or lower the likelihood of occurrence? Now nine times out of 10 what ends up coming out of this are metrics, right? You have these business processes and you need to identify when a particular action has happened. So a lot of times you'll go back to the developer team and say, hey, when you attempt to insert a record in the database, send a tick to New Relic or send some sort of feedback in the log. There's a number of different ways that you can approach it, but it's now that it's out in the open, you're actually thinking about it and you're coming up with an actual developer based solution as opposed to spaghetti code doing selects on your database as a monitoring solution. Not that I've ever done that, ever. And then ensure you have a feedback loop, right? As these particular important operations happen, how do you know that they're happening successfully? Again, going back to metrics, counters, knowing that yes, an order was placed and payment was received for that order, right? Those should always equal zero. When they start to get out of whack, you're like, oh, we've got some sort of problem, right? We got a lot of orders and not a lot of payments, that's bad, right? Or you may be monitoring something like how many of our transactions are cash versus credit card. If the cash number shoots up, you're like, hmm, there's something going on with credit cards because suddenly, we've got 40% more cash orders than normal. Maybe someone found the coupon code, I don't know, could be anything. But identifying all of those people that are a part of your team and making sure that everyone has the information necessary to make sure that they can do their job and do it with a certain level of affinity. And as you come about this, you'll come up with leading and lagging indicators, right? Leading indicator being something that's like, can give you a hint that the system is about to enter into a particular state, right? So you kind of know about it, you can see it happening before everything goes foobar, right? Lagging indicators being something like reporting, right? Accounts receivable, something like that. But leading and lagging indicators is a good way to kind of bucket the type of metrics and monitoring that you're building so that you have an idea of how quickly you're gonna be able to react to a thing and what those controls and reports are all about. So brief recap, examine your process, assemble cross-functional teams, brainstorm all your potential failure modes, calculate your RPN, develop action plans to reduce risk and then profit. Thanks. All this stuff you mentioned, any particular tools you are using to make it happen? Yeah, I wanted to assemble a blog post before I got here, but kids. I use mind node for the brainstorming portion and then I will post the link in the Slack room in just a little bit to a Google doc that you can use as a template where it will automatically calculate the risk priority number and then kind of give you a structure for going through that conversation. No, I mean, you mentioned a lot of things about making monitoring better and stuff like that. Any actual tool about monitoring? Oh, you want an actual tool. So I mean, honestly, using anything is better than using nothing. We're a client of New Relics. We use Datadog, so we do a lot of our monitoring and metric stuff collection just as one-line things that the developer can say, oh, I did this action and we're sending it off to New Relic or Datadog. And that's a great way to do it because then sign up for an account and you don't have to worry about infrastructure or anything like that. I will say that it's almost a DevOps days troll to have an open space on monitoring so you can do that and see half the conference show up. So before any of us go into work and say, I saw this talk and we should take this FMEA approach with RPN, will you be able to also post in Slack or on the blog some of your best references to getting started on the literature to back up a little, not that you aren't authoritative on this, but a little bit more to go on. Yeah, luckily there's a lot of literature out there because again, this isn't anything that I've invented, it's just something I've co-opted. So I'll definitely point you to some of the quality organizations and some of the documentation there. It is pretty thick, right? So one of the things I would recommend is as you go through it, don't feel like you have to do everything that the quality organizations kind of dictate that you do. Take it in small chunks. This is an iterative process that we're going through internally too. It started off a little lighter and got a little meatier, a little meatier. So incremental improvement is always good. So as you increase scale, it was a great presentation by the way. Thank you. As you increase scale or do you think about scale, increase do you run a risk of, because it sounds like a manual process near the backend and filling risk of signal to noise ratio say on the one through eights. And how do you think about that as you scale up? How do you make sure that you're not spending too much time on one, two, three, fours or fives as opposed to six, seven, eights, whatever. Am I making sense? I think I know what you're saying. And basically in our organization, like all of the metrics and monitoring that we collect involve some sort of developer resource. Because of that constraint, we know that we may only be able to lob off the top three or four. But what we do is because of that risk priority number, we can then figure out which ones are the most important to tackle. The other ones are risks that we know about, but we just haven't addressed yet. And then as the developer teams get free cycles in their sprints, they can kind of continue working through the list. So this isn't a gate that we say until you address all of these risks, we're not going to production. It's really about surfacing those risks and saying, okay, what are the things that we absolutely have to take care of? And then what are the things that we know about and can take care of later on? Because whether you do this process or not, those risks still exist. Have you tried putting a dollar value to any of the RPN numbers to help put these things into sprints as features for the product? That's a great question. We haven't done it yet. That is definitely something that we're doing phase two. We're kind of getting the teams kind of used to the idea. And the nice thing is, when we went through this with one or two teams, each of them were like, this was great. I would have never thought of this stuff. So now they're coming to us and saying, can you walk us through the FMEA process? So phase two will definitely be about trying to assign a dollar value. Right now, the biggest thing that we're doing that's kind of close to that is when we talk about severity, we frame it in terms of the context of order loss or traffic loss or something like that. Have you guys tracked how long it took to implement this and for a given system? Hey, it took this long to implement this and we noticed that the more we monitored, the more we put in the system, failures went from 10 to zero overnight. That sort of graphing you could throw up. So we're still fairly early on in the process, so we don't have a lot of that feedback. One of the teams that's probably furthest along in the process has started to see a drop in customer alerts, because we basically have an app that is used internally, heavily, but impacts a lot of our driver network. And we would see tons of calls about these driver incidents. But now, bringing in the coordinator on our side and bringing them into this process, they identified a bunch of things that they're like, you know what, if I had this data, I could head off these issues. I could shut down a market. I could increase the number of drivers in a market. They can react quickly in a way that they didn't have that sort of feedback before. So we're definitely seeing positive impact and our goal is over the next six months to kind of solidify that into some sort of, I don't know, maybe even some sort of publishable paper, but we'll see. Wow, that is really cool. And Jacob was also hinting at some of the stuff that was in his talk last year at DevOps Days, which is online, link to from the website. Any other questions before we head to lunch? Thank you, Jeff.