 Thanks for coming. My name is Matthew Beckman, and this is There Is No Root Cause, Emerging Behavior in Complex Systems. A tiny bit about me, one of the many roles that I play right now is a developer advocate with VictorOps. VictorOps is an incident management platform, much like Office Genie and PageDuty. Most of the talk today is based on the work that we've done there, trying to help teams who are adopting DevOps move their incident management practice forward. I'm also a technology strategist with drys.io and trying to help teams of all different sizes and steps in their DevOps journey essentially get better at whatever they're trying to do. I kind of joke that I'm an 18-year DevOps veteran. Once upon a time, I was an infrastructure person who just didn't hate developers. I missed that memo, apparently. So DevOps is something I'm really excited to see become the norm or increasingly the norm for teams because the opposite way of operating makes very little sense. So super briefly, VictorOps, like I said, is an incident management platform. They are capable of ingesting alerts from multiple upstream sources or events, whatever you would like to route to someone, applying standard routing logic, schedules, escalations, and rotations. And then they also have the ability to transform those alerts, inject context, inject graphs, whatever things might be necessary for your incident management team to more quickly respond and resolve to incidents. So with that said, let's kind of start. I imagine everyone is familiar with root cause analysis in name, if not in practice. Root cause analysis is a framework that helps us understand what went wrong, who is responsible, what was responsible, why did that happen, and probably most importantly, how can we make sure this never happens again. Root cause really had its roots. I'm gonna make a lot of puns today and I'll just apologize as a starting point, but root cause has its roots in early NASA, so it's been around for close to 60, maybe almost 70 years. It has grown in manufacturing and electrical practices in the early part of, maybe you might consider, the information age, but it's really matured and flowered in the last 20 or 30 years as it's been applied to information systems during the internet age. If you've never done formal root cause analysis, there's a sort of maybe a mental switch that you have to accept right out of the gate that it isn't really about finding one source of causality. Root cause does invite us to consider multiple causal factors, but it all essentially reduces down to something bad happened and now let's talk about what the causes of that badness were. This is a fishbone diagram and root cause invites us to sort of list out the various things that we think might have contributed to this undesirable state or this brokenness, and you can kind of build on this and get into like a sentence diagram type of approach where you're gonna add additional data points to support that people were causal or that the systems were causal or what have you. And this is sort of how most teams approach root cause analysis if they're doing it in a formal way. And that's okay, it's pretty simple and I'm gonna spend really the first half of the talk kind of speaking about how root cause has come to be and the second half telling you why you shouldn't use it. When you think about like the dawn of the information age and the systems that many of us still run but that really defined the growth period of root cause analysis in IT, it's straightforward. There's three tiers, there's a web app, there's a web server that's just gonna display stuff, there's an application server where there's a great deal of logic and there's a database where there's an awful lot of constraints and maybe some logic also. But it's kind of simple, right? There's not a lot to it. I would argue that there's actually quite a bit more to it and one of the first critiques I'll offer of root cause is that as it has been implemented by development teams and operations teams, they narrow their root cause to their domain of expertise. So the developers are focused on those three tiers, the ops people are focused on everything around it, both of which completely ignore the complexities that arise from those two separate concerns being joined together to develop and deliver an application. Then just go ahead and factor in the internet. Like all of our applications and probably most of our applications are constantly being accessed by users in both benign and malevolent ways and that adds an entirely new level of complexity to the systems that root cause has been trying to help teams manage. But again, I sort of dismiss the idea that root cause has been sufficient to really account for the complexity that we've built. When you think about that, whether you accept that that's a simple architecture or complex architecture, I want you now to consider the actual ways that we managed them for the last 20 years. And again, many people are still doing this. Whether that was a complex system or not, waterfall development methodologies encourage us to manage the complexity down by locking state, right? We tried to build systems that were static. Let's have a system that does not change for 10 months out of the year. Maybe we'll have an operating system patch. Maybe we'll have a critical bug that goes in here. But however complex or uncomplex that pile of stuff is underneath there, it's not changing until we get through our waterfall project and then we're gonna have two months of utter chaos while we make these deployments in 10 months of change all flooded onto the system at once. But the good news is, is we come out the back to a new steady state that we can carry forward for another 10 months. And we can all sort of walk around with this idea that our systems are static entities that aren't changing. This is the reality of the architecture and the development approaches that have sort of governed the growth of root cause analysis in our industry. Sounds ridiculous, like in modern thinking, but there's really some good reasons, right? There were a lot of constraints in the 90s and the 80s and even the 2000s still that sort of gave us good reason to build systems that were simple and to have development approaches that manage change in this way. There's kind of a list here, but for me like it used to really be a thing that you couldn't do continuous deployment because copying artifacts was limited by the speed of the hard drive, right? Good news, none of these things are really true anymore. And that is what has I think set the stage for us to adopt all of the DevOps tenants that hopefully were pretty passionate about around continuous integration, continuous deployment, continuous improvement, and in much more complex systems. Root cause analysis then focuses on a fairly static model which I don't think is great, but moreover I think it focuses teams in a very binary way of thinking about the world. Things are working or they are broken. We start in root cause with something is broken and now we have to understand why it has failed. I think that we all understand in a DevOps context that there's a great deal of nuance between good and bad, broken and working, problem and expectations. And I'm gonna talk more about the sort of binary thinking model in root cause, but these are just some of the words that you see when you look at RCA as used in teams and is used in the research behind it. So let's compare and contrast for a second. So here's our three tiers that sort of what govern the way RCA has come to be. This is maybe a little bit more of a modern architecture. Don't try to figure out what each one of these boxes are. This is just like a reference implementation for microservices in AWS. We have lots of tiers. We have lots of boxes. We have lots of things going on here that belay a kind of complexity that just isn't present in this standard architecture. When you really then start to think about the way we handle requests and how traffic flows through our systems, how user stories are realized in modern complex microservice architecture, just being able to get your head around how requests process has now become an entire area of practice that never existed 10 years ago. So it adds an additional layer of complexity. Not only is our architecture or infrastructure complex the way that we are managing requests becomes complex. Let's talk for a second about the change vectors that we introduce. So this is a system that I bet everybody knows isn't actually static, right? We're gonna have AB tests going on within this environment. We have probably ABBA and ABCD tests. We've got a variety of tests, each one of which represents a different state that your system is accounting for and as a DevOps practitioner trying to understand patterns and behavior in this system, a variance that you have to account for in your mental model and in your approach to managing these systems. That feature flags all over the place that are being activated and deactivated all willy nilly and you can try to control that in certain ways for sure. But the reality is a separate change vector that you have to account for when you're trying to understand complex patterns. Certainly we've got deploys going on, maybe five, maybe dozens a day, who knows, but the state of the system is being changed from a code standpoint in a frequent way and then all of the operational things that are happening under the covers in infrastructure, whether that's separate from your deployment process or not, you get to a place that there's a great deal of change that's occurring in these systems constantly. And let's not forget the internet. The internet's still there touching all of these things and I don't really focus necessarily on the monsters in an evil doer hacker sort of way in this but more about the ways that users, user applications changing. We have all been present when perfectly excellent code and wonderful systems meet a user doing something we didn't expect and creating a pattern that we don't particularly like. Hopefully you'll agree that our systems now are far more complex than they were 20 years ago. So as I tried to kind of map that to this, I just hit a record scratch. It's not a tree. This is sort of the standard mental model for root cause analysis. Beautiful tree above us that we want to foster and make healthy and the roots underneath that are what we're gonna look at to try and manage it. Our systems aren't trees, they're forests. We have diverse, rich ecosystems of applications, each operating along sort of narrowly defined guidelines but which together create an ecosystem or an environment where our user stories and our applications can come to life. Let's talk for a moment about what I mean when I say emergence in complex systems. Emergence refers to the existence or formation of collective behaviors. What parts of a system do together that they wouldn't do alone? I'll pause when you start digging into emergence and complexity, there's a lot of analogs to biologic systems and artificial intelligence. I don't think anybody's really cracked the AI nut but I think that in our systems today we have enough individual component parts that operate together to create truly emergent patterns in our systems. So we talk about properties and behaviors. Properties and behaviors of the systems arise from the both defined structures which you could think of as the discrete unit that you're deploying or the application that compose those systems and the interrelationships between the systems discrete parts. I don't invite you just for a moment to consider your microservice as it is a part of the rest of the whole inclusive of all of the chain vectors that you're introducing on a regular basis to start to get your head around just how complex and how emergent the patterns and behavior are that we're trying to manage. Let's do some more compare and contrast. So root cause language, very much focused on problems, prevention, causal, and actions. When you talk about emergence, these are I think first of all, friendlier terms. They're terms that are more inclusive that don't lead us to try and establish blame and think of things in very negative ways. We simply walk into the reality, I almost said the problem. We walk into the reality and know that there are behaviors that we have to manage. There are new patterns that will emerge that we have to deal with. Some will be desirable and that's great and we can put more effort towards those patterns and some will be undesirable and we are probably gonna get a page or an alert about that and have to do something about it. So our shared reality today is massive complexity, dramatic change vectors, and that creates a reality for us in our systems and our application that is defined by emergent behaviors. Are we doomed? Yes, that's the end of my talk, thank you. No, the good news is not utterly, utterly hopeless, but a word of caution. If you go back to your office later today and you say, hey, I was in this talk, this great conference and this guy told us that the systems are now too complex for us to manage, you're gonna get some pushback. It's a very unnerving reality to seat yourself in that these systems that we're creating are going to do things that we will never predict and we still have business rules and requirements to adhere to. We still have to deliver those services that are running on the systems that we've just created. But I wanna talk now about Knieven. First and foremost, I'm not Welsh. This is a Welsh word. I've been taught that it's pronounced Knieven, so if I'm pronouncing it wrong, I apologize. Knieven means ecosystem in Welsh, apparently, again. I'm not Welsh. Knieven is a framework that was developed by Dave Snowden when he was working at IBM to try and really help teams understand complex patterns and behavior. Originally, it was focused to help IBM manage their intellectual capital, which is not a system, it's not an application, but I'm sure we can all appreciate as a wildly complicated thing. Knieven is a relatively new framework. It sort of came out of that research in the early 2000s. So as a starting point, it understood, or Dave understood, the kinds of complexities and the kind of systems that were being built. And it draws on research in systems and complexity, network and learning theories. So what does Knieven actually look like? So we have four quadrants, and you start in the bottom right and move counterclockwise around. Simple categorizations, excuse me, simple quadrant, complicated quadrant, complex quadrant, chaotic quadrant. And the idea with Knieven is when you're experiencing a new pattern or behavior, whether it's desirable or undesirable, we start with which quadrant does this pattern or behavior map to? Is it simple? Is it chaotic? Is it complex? And then we have sort of a three-step process that informs what we should do with that pattern given which quadrant we've mapped it to. I'm gonna talk about all of these in more detail. There's a fifth, not quadrant, but squishy middle part there called disorder, which is a really unpleasant place. You do not want to be in disorder if you can avoid it. But let's kind of dig into what each one of these means. The simple quadrant is, spoilers, simple. It's the bread and butter of your environments. It's a disk filling up. It's a packet that gets dropped. It's a system in AWS getting rebooted unexpectedly. These are the known knowns. The best DevOps teams have automated the responses to these. And excellent DevOps teams, when they recognize that a pattern meets a simple quadrant, a criteria, we write code to make the response to that automatic. But if you're not there and you've got a pattern that's in simple, you sense, so we become aware of what's going on, we categorize based on the sense that we've got, this is a disk problem, we respond, we delete the locks. These are very much, like I said, known knowns. The complicated quadrant is, then thought of maybe as known unknowns. Things that are probable. In your meetings, you've talked about this as a potential outcome or potential pattern that we might see. There's more going on here than a simple response that we can really deal with. But it isn't really to the level of true emergence. In this quadrant, we're encouraged to first sense, again, understand or get some data about what's going on with this pattern. Then we're analyzing it. This is human interaction, you or your team, really thinking about what's happening in this pattern. And then we respond. I like kind of the metaphor for complicated of a busy harbor. There's a lot of expected outcomes in a busy harbor but a boat can bump into another boat and pilots can do things unexpected and you have a fairly complicated reality. Traffic also is very much like traffic jams or complicated patterns. Complex is where we really start to get to emergent behavior. A complex pattern or complex behavior is one where we have some eyes on the data that we might need to understand what's going on, but we don't have all. And so the first thing that we're invited to do and can even is to probe. We don't have metrics for this. We don't have the right dashboard. We need to understand it. So we need to go and get some information to help us further clarify what's happening. Then we sense the output of that and then we respond and we iterate. Riots and large crowds and the way they behave in a human sense are very much in the complex quadrant in a uneven framework. The fourth quadrant is chaotic. We've probably all had at least one of these patterns in our careers. It's the buckle up time, man. This is gonna be crazy and we don't really know what's going on. I think it has great advice here and it's start with action. Try to disrupt the pattern. Go restart something. Go turn something off. Go do something and see what they have. Result of that something is sense and then further respond based on that, right? If you don't have eyes on it, if it's too complex, you'd even know where to start. You're in the chaotic quadrant. Get a coffee and call some friends because you're gonna be there for a little bit. But start by acting. Do something. The last quadrant or piece that is important to talk about in Kenevan is disorder. Disorder in my interpretation really is a social thing. When you and your engineering friends are sitting around trying to determine which quadrant your new pattern or behavior is in and nobody can agree and we start arguing about it being your problem or my problem or I'm in this department and you're in that department, we're in disorder, right? What you have to do when you can achieve consensus in Kenevan is start by reducing what we're discussing. Maybe we can't have agreement in a large scale what we're talking about. Let's compartmentalize the problem or the behavior into smaller pieces until we can get some consensus. Then we can analyze that reduced view of the pattern and then we can iterate and we can move from disorder at least into chaos. It starts with some basic consensus in the team. Not really sure why I picked a picture of an elephant in some gazelles but it looks kind of disordered. What's incredible about Kenevan as contrasted with root cause analysis is that it's not only present for you in the moment but it helps you create a roadmap for managing these complex systems. In Kenevan, if you take any pattern or behavior from any quadrant, tabling disorder for a minute but any of the real quadrants, with knowledge and practice, you can move those patterns towards more favorable quadrants. You may start with something that was wildly chaotic but through understanding and through practice of managing those patterns and their associated constituent parts, you can move it into something that's merely complex and further down the line. Now that isn't to say that all patterns can be reduced to simple but it encourages you to seek out understanding and get better at managing the patterns because there are patterns that you're gonna be able to move all the way down into the simple quadrant. There's sort of a warning in Kenevan that if you start slacking off, if you start skipping maintenance windows, if you start just not doing the best that you can to manage these patterns, retrograde action occurs. Once upon a time, simple quadrant problems or simple quadrant patterns become more complicated. Disks filling up on lots of servers will create all kinds of cascading problems for you that you really could have managed better and kept those behaviors in the simple quadrant. Maybe the last thing that I wanna think about or encourage you to think about with Kenevan is that an adoption of it is not a complex thing. I think of this in three basic steps. In the moment, you have received a page, you are responding to a problem. Ask yourself, which quadrant does this map to? You're gonna be wrong the first dozen times and probably every time after that, but you'll get better at being less wrong over practice. And you can encourage everyone in your team to do this. Just as we're talking about patterns and behavior, which quadrant are we in? Is this complex? Do you think it's complicated? All right, let's have a quick chat. Cool, now we know what to do. In your post incident review, focus certainly on all of the things that you might have focused on in a root cause analysis model, but also ask yourself, how well did we manage the pattern? Did we start with complicated and end with complicated? Were we able to move complicated down to simple? Did we start with complicated and discover it was actually chaotic? Boy, I'll bet that was a day. You get to talk about the way you were approached managing the pattern as part of post incident review and that moves your Kenevan practice forward. And then lastly, as I said, Kenevan encourages you to think about your roadmap. In your sprint planning, you've got to identify the patterns, where they are in each quadrant and what can we spend time on this iteration to move this forward into a more manageable place for us? How can we, in a realistic way, manage down the complexity of our environments without doing things like shutting off CD, freezing feature flags and trying to lock stasis into our complex environments? Let's just do side by side. Root cause analysis invites you to search for simple causality, as what went wrong. It's both a static model itself and it is born of static modeling and encourages teams to continue to move forward in a static mindset that is no longer relevant to modern architecture. Root cause analysis has encouraged you to think about binary truths. It's good or bad only. There is not a lot of nuance present in that framework. RCA is really only something people use, usually in the post incident analysis after the fire is done and we're all sitting around talking about it. Then we get at our fishbone. You rarely see incident management teams breaking out of fishbone as a first starting point. And last but not least, it focuses on blame. Whether it is inherent in the framework or just how the framework has been adopted in our teams, RCA is about figuring out who done it and then making sure that person doesn't do it again. That's a terrible anti-pattern for a DevOps team. Meanwhile, Kenevan is a highly dynamic framework. The framework itself is dynamic and that maps nicely to dynamic systems. It expects change. I'm someone who believes that expectations matter greatly and if you expect that your system is unchanged for 10 months, you're going to have an unpleasant awakening. And if our frameworks instead encourages us to recognize that things are going to be everything, we're going to run the gamut from simple to wildly complex and complete disorder, at least our expectations are properly set now for the reality of a modern DevOps practice. It even embraces emergence. It expects it and it helps you figure out ways to move from emergence into more understood and manageable patterns. It's present in the moment. The first time you're aware of a pattern or behavior that you need to think about, you have a guide for what you should do. Is it simple? Is it complicated? Is it chaotic? And what should we do about that reality? And lastly, sort of along with that, is it's a call to action, both in the moment and in terms of how we manage our road maps. There is an implicit need within Kenevan for you to work these patterns to understand them better and to practice managing them so that they can become more manageable and less disruptive to your environments. When you get there, I think teams that have adopted this have a very enlightened view of the world. There is no broken. There's patterns that are desirable and there are patterns that are undesirable and we have an expectation that both sides of patterns are going to be created either intentionally and unintentionally in our world. This helps your team have the right mental model to approach managing modern, complex, emergent systems. I'm gonna leave you with this quote of Voltaire is this dude from the French Enlightenment. He was not talking about microservices when he said that uncertainty is an uncomfortable position, but certainty is an absurd one. And I leave this here just as the parting thought because I find in almost all of the conversations that I have with teams that certainty is this thing that they seek. And whether that's coming from management down to the engineering team or whether it's because of root cause living in the engineering team, we all wanna be sure that this isn't gonna go wrong. We wanna know what's gonna happen. And I hate to be the bearer of bad tidings, but that just isn't real. You can be sure within some range of probability and you can be confident within some range of variables, but you're not gonna be certain about anything in the modern, complex emergent systems that we're building. That's all I have for us today, folks. Thank you. Yeah, so the question was can I elaborate on the difference between simple and complicated? I think simple, I won't try to elaborate on a lot more. It's things that you know are likely to happen, right? And again, the example there is maybe a disc fills up, right? The unknown unknowns or the, excuse me, the known unknowns in a complicated quadrant, just to stick with storage, might be rate of raise failing, right? That's the thing that we know could happen. It's not probable, but there are ways that that can happen either through mismanagement or through just the reality of systems breaking or exhibiting poor behavior. That creates a different pattern. You're not gonna automate a response to corruption in a raid system, but you are gonna have to manage that kind of system. That helped. I think this is, I probably should have had more examples lined up, but this is sort of part of what's great about Kenevan is you can talk about this within your team and determine what complicated means and what the barriers are between those two. The complicated is things that you know could go wrong, but you haven't scripted responses to. You don't have formal responses on. So you kind of talks a lot about sort of your analysis and response process being sort of separate from the system that you're analyzing and responding to. To me, a big part of DevOps is about making systems that degrade gracefully, that are self-healing, that are anti-fragile. And so how do you sort of square that or how do you incorporate that into your analysis process? It's a fantastic question. Where the system might actually be part of the complexity is the system trying to fix itself in a sense. Yeah, so I think at least the way I map that is as we are trying to design a resilient system, we have to be mindful of the patterns that that system may encounter, right? And if there are simple patterns that that system may encounter, then we can very much make them self-healing, right? I think that you've properly identified that that very act adds complexity to the systems that we're trying to manage. And we have to, at least for me, recognize you're not gonna get it. You're not gonna make something perfectly self-healing, but be mindful of the patterns that you can self-heal. And then when things don't go the way you expected, you have a framework to help you manage that new reality that you've either created for yourself or has been created for you. You talked about, with practice, being able to walk clockwise on the graph of complexity. So let's say, hypothetically, that I have a real-time, fairly fault-tolerant application that breaks, I don't know, once a year, but when it does, it tends to act chaotically. Are there any ways that, given the relative absence of failures, we can still practice Kenevan so that we could still walk clockwise on that graph? Yeah, so I think that that's a great question. Let me go back here for a second. So, well, I was just trying to get it. Yeah, so the question was essentially, the system is so rock solid that it very rarely fails or exhibits an undesirable pattern. And when it does, then it's wildly chaotic, but because it happens so infrequently, we don't get a lot of chance to practice. I think this is where game days are a huge thing that a team can adopt, particularly where systems are, I'll say, overly stable. Like, the expectation for me is that systems are gonna be exhibiting undesirable patterns frequently. And if they aren't, that means you're blind to it. And I think that that, in our chaotic quadrant, we're blind to it because it happens so rarely, we just don't give it time, right? So if, through a game day, you can actually sit down and think about complicated patterns, known unknowns, that might occur, that cause this system, your system to behave in a chaotic way. Let's take several of those and let's game them all at once. Let's try and create some chaos to get a little bit of understanding and at least move that reality from chaos to complex. All right? Thank you, everyone. Thank you.