 So, yeah, I've been excited to give this talk for a long time. Actually, this has been kind of my jam for a while. Cognitive bias and how it impacts our judgments and choice. So my name is Jason Hand. I'm at Victor Ops. Victor Ops is on call on incident management. I know a lot of you have already stopped by the table, had a lot of great conversations. Anything that we talk about here today, maybe we can do some more open spaces on it. Or feel free to come by the table. We'll talk more. But that's me on Twitter. Anything I say, for sure? Share it out there with everybody else. The thing I wanted to talk about today is really a fairly complex topic. And it falls in line with a lot of the things that we do as operations people, as developers, as just general DevOps people. But we operate in complex systems. It's sort of the main theme here. And at some point, something's going to break. Things just tend to break. At some point, our servers light up. This might be a small incident. It might be some sort of SEV-1 outage. But the point is, sooner or later, something bad is going to happen. And so what we've started to learn now, or what we're beginning to understand, is actually that the importance of time to detect, as well as the time to acknowledge, is super important. But the time to repair, or the time to recover, or the time to resolve, whatever you want to use there, is actually probably more important than anything. And while a lot of companies, a lot of people still do root cause analysis, a lot of high-performing teams have actually stopped doing root cause analysis. They've realized that our systems are far too complex to really distill it down to one simple problem, something that actually we can identify as the one thing that took out of our system, it's actually going to be a lot of contributing factors. So if you want to distill it down to one thing, it's just change. And really, we want that change. That's just part of what we're doing all the time. And so anyway, with all these contributing factors, we want to actually know more about what's going on. I can go into another 30-minute talk about root cause analysis and how it's a complete waste of your time and all that kind of stuff. So maybe we'll do an open space on that. But we're moving away from this idea of strictly prediction and prevention of problems, and more towards actually recovering from them. So detecting them, acknowledging them, recovering from them, but then probably most importantly, actually learning from them. So it's really important in this day and age that we have this mindset of continuous improvement. And the byproduct of continuous improvement is all of the things like continuous delivery, continuous integration, and of course becoming a learning organization, which is really the big, big thing that most companies really need to focus on lately. And so because of this, our goal is actually the mean time to repair rather than the mean time between failure, which actually Ken's brought up a number of times in funny different ways. But it's true, really. We want to be able to detect what's going on. We want to be able to recover from that fast. And we've all heard fail fast. It's a common phrase in our industry that gets thrown around there. And really the thing I kind of think when I hear fail fast or when I read fail fast is actually we want to recover fast. We want to understand what's going on, and actually be able to do something about it quickly. But then most important is actually conduct some sort of learning review or a post-mortem, some sort of post-incident analysis so that we can identify areas of improvement. We can look for those target conditions. We can make incremental changes. Like Courtney mentioned, boiling the ocean is not what we're out to do. We're just trying to find small incremental changes that actually improve our situation, improve our processes, improve our systems overall moving forward. And so speed becomes a big, big kind of thing that we're focused on. We're trying to get this thing done really quick. But our question, or my question today in the point of this talk is when you're focused on speed and you have all these other constraints of time and the SLAs, don't get me started on SLAs, lost income, angry customers, people on Twitter, all of a sudden there's other things that we have to kind of consider when we've got the pressures of time and everything else that's associated with downtime. When we're operating at a speed like this. And we often get to where this is how a lot of us feel when there's an outage is kind of how I feel right now. Anybody watch the new Muppet Show, by the way? It's awesome. It's really, really good. Anyway, back to the topic. So it turns out we begin to, under these constraints, under speed and everything else, we begin to actually rationalize rather than seek out rationality in some of our decisions and our choices. We have these mental shortcuts, these very systematic mental shortcuts that we take in the name of speed and efficiency when we're trying to solve problems. And much of this has to do with the way that we think as we process these judgments, these decisions and these choices. So a lot of the things that I'm talking about today come from this book. This is a really amazing book. Many of you have probably read it. It's shared quite a bit through our industry. It's very dense. It reads like a high-level PhD kind of textbook. I got through it basically only because I listened to it on audio book, on my flights zigzagging in the country. It took about a month, but it's worth your time, for sure. And in that book, Daniel Kahneman identifies that there are two key ways that we think about everything that we encounter. He calls them System One and System Two. Now System One, also it's not referred to in this book as shallow work, but I've read some other pieces that kind of refer to the same thing as shallow work. So System One, also maybe called shallow work, is really more effortless. So these are things that we just do naturally. They're instinctive. We don't even really have to think about them, but there's also still a lot of systematic errors that seem to creep in under specific circumstances due to our own cognitive bias. It acts without effort or control, and we'll see some examples of that a little bit later, but it's also very, very bad at statistical analysis and even seemingly basic logic. A lot of it's just automatic. They're just things that we do. I sometimes think it's not a technical example, but sometimes I go on these long road trips and a few hours pass by and I stop to think about all the miles that are behind me and I don't even remember any of them. I don't remember the turns. I don't remember hitting my turn signal. I don't remember reaching for the dial. There's just a lot of things that are sort of automatic that we do. We just sort of do it. And then we have system two, and this is more of the deep work stuff, right? This is when we really have to focus, and it also has bias. But system two's defining feature is that it's very effortful, and one of the main characteristics of system two is that it's actually very lazy, and it requires, well it only does really the minimum of what it's required to do, and it also relies a lot on system one for its inputs. And there are vital tasks that only system two can perform that require these efforts, and also these acts of self-control where we actually have to kind of stop and think about what we're doing and how we're thinking, which the intuitions and these impulses of system one really can't actually manage at all. So we have many thoughts and calculations and inputs, all jockeying for our attention, and they all require more than just these automatic and effortless responses. Oh, I just realized I need my audio cable here. Okay, so in a little bit we'll need that, not just yet, but I've got a couple of exercises. So one of the things that I was a little nervous about giving this talk is because I've got some group participation and I've never done a talk where I actually sort of require the audience to get involved a little bit, so hopefully bear with me and this'll go smoothly. Now some of these exercises you may have seen before, so if you have, don't blurt out any answers. Don't ruin it for anybody and hopefully it'll go great here. But the first one, I'm sure maybe a lot of you have seen something like this, but I'm gonna give you a few seconds to read the next sentence and then we'll see how we do. Okay, I failed to say count the Fs, but hopefully you saw that on the screen. So, how many did you count? How many of you counted three Fs? Okay, and how many of you got six? Okay, so it seemed to be about 85, 90% of you saw three and I think I saw about six hands that saw six. So let's look at it again. The correct answer is six. And the reason why you're not seeing them is that you're not spotting the Fs in the word of. Our brain does not process of as F. Let's do another one. Okay, good start. Friday the 13th, good. Okay, so now this one's a little harder. So I want you to take this number, 7093, put it in your head, and then what we're gonna do is we're gonna add one digit to each of those digits, okay? And then we're gonna keep doing that, moving forward. So, that first number was 7093 and now we're 8104, okay? So you kinda got the first one a hint. Now I'm gonna give you about three seconds and we're gonna just keep going through this and we'll see how you do. First one's 7093. Could you do that? All right, not too bad, not too bad. Now, what about, what if it was five digits? Do you think you could have done it if it was five digits or six digits? All right, what if we go a little faster? What if we pretend that this is like production outage? First of all, what was the last number? I'm hearing a lot of numbers, they're all incorrect. Okay, again, got it, 0326. All right, ready? We're gonna do this and we're gonna do it faster. And go. It's really hard, right? Did anybody, was anybody able to actually keep up with the numbers? Okay, did anyone spot any problems? A few people did. So the last transition went from 4760 to 5971. And you probably didn't catch it. Most people don't. How many of you are responsible for being on call? You're carrying, card member, carrying, paging people? Does this feel familiar when you're dealing with something that's broken? For those of you who are on call, you may wanna close your eyes. The next slide is a little bit disturbing. The most cognitively effortful forms of thinking are those that require you to think fast. The additional pressures associated with an outage make it even more effortful. And we end up feeling like this. Okay, here's my portion of this show that has a little bit of audio. So hopefully we got somebody back there, man on the screen, all right? How many passes does the team in white make? The answer is 13. But did you see the moonwalking bear? Did anybody? We were so focused on counting the number of passes with a white shirt so we didn't even notice the bear. Okay, anybody for me with a stoop test? Couple hands, okay. So this next slide, I'm gonna give you about I think six seconds or so to read through it and we'll see how you do. Okay, I should have been, I would say for this, the brainy power in here, we could repower through that sentence pretty quickly, all right? So I guarantee you probably read through it, no problem at all. Now let's try again slightly faster. This time, this exact same words, but there's a few, there's a little bit of a change. All right, so I'll give you about three seconds on this one. Probably didn't get through it, probably didn't get past the first line. And the reason was simply because the colors didn't match the word. Did anybody actually make it past the first line or so? Just a couple of people. So what this is is actually, it's the efficiency to thoroughness trade-off, okay? And mistakes are fairly predictable under these circumstances. And so a little bit about ETTO or efficiency to thoroughness trade-off is that you really can't be highly efficient and highly thorough. You can only be one or the other and we try to find a good balance between those. But a lot of us feel like they can be, we can operate at a high level of efficiency and also be extremely thorough. And the fact is that you just can't. And the reason is because of cognitive bias and a lot of other factors, but a big one, at least what I'm talking about today is cognitive bias. So back to that book, Thinking Fast and Slow, and I'll read you a quote from Kahneman in that book. It says, there are distinctive patterns in the errors that all of us make. Systematic mistakes known as cognitive biases along with impressions and thoughts form within our conscious experience. This occurs naturally without us knowing there, even there or how they came about. The mental work that produces these impressions, intuitions and decisions takes place silently within our own mind. Now there's a ton of cognitive biases and I'm gonna mention a few of them, but this is a whole area of research. I'm fascinated by it and I hope to continue doing more work on it. But let's definitely do an open space because I feel like I'm just gonna have to blast do a lot of this stuff. Let's do another quick sort of example, I guess. So imagine a person, we'll call him James. James is very shy and withdrawn, very helpful person, has a little interest in other people, a little bit of maybe introverted also. Meek and tidy, needs order and structure and has a passion for detail. Now, based on what we know about James, do you think that it's more likely that James is a librarian or that James is a farmer? So who thinks farmer? Okay, and who thinks librarian? Okay, so most of you, most of the room thinks librarian. It was pretty close, about a 50-50 split, but a little bit to edged to librarian. Now, there are more than 20 male farmers to male librarians in the world. And this is an example of us using resemblance, it's a heuristic, but the resemblance of James' personality to that of a stereotypical librarian is typically what pops into most of our heads. And we totally ignored any statistical relevance about what actually is the truth between the differences in librarians and farmers as far as how many there are in the world. So this is known as a representativeness heuristic. We judged the probability of James being a librarian over the farmer simply based on resemblance and circumstantial evidence. And because there are so many more farmers, it's almost certain that someone matching this description is a farmer rather than a librarian. However, we ignored the relevant statistical facts. So a heuristic is really just a shortcut. It's just something our brain does, we don't have much control over it. And it pops up when we're trying to solve problems, especially when we're trying to be quick and efficient about it. So, and then you throw in this idea of, especially if you're somebody who's on call, you are trying to make a lot of quick decisions without certainty about what the results are gonna be, what's going on in some cases, and then you have the constraints of time. I'm gonna read you another quote from this really great article I found on brainpicking.org about associative coherence, which is also related to all this cognitive bias. But it's the notion, or associative coherence is the notion that everything reinforces everything else. Much like our attention, which sees only what it wants and expects to see, our associative memory looks for and looks to reinforce our existing patterns of association and deliberately discounts evidence that contradicts them. And therein lies the triumph and the tragedy of our intuitive mind. I don't know if you can read this up here, but this is an example of us having really poor correlation and causation skills. But I'm not sure if this is true or not, but the divorce rate in Maine somehow has a direct correlation to the per capita consumption of margarine. There's also a direct relation to the consumption of mozzarella cheese to civil engineers, doctorates awarded in the US. Seems legit. And there's also people who drowned by falling into a swimming pool related to the number of films Nick Cage has appeared in. Now, obviously these are meant to be funny, but we do seek out correlation and causation when there's really nothing statistically there or logically there to actually back that up. It's just our mind sort of, you know, and a lot of this is us just sort of trying to deduce and look for patterns and try to find things that are actually maybe related to what's going on. We talked about why we don't do root cause analysis because there's a lot of contributing factors and we wanna understand all of them. We wanna know the story. We don't wanna try to distill it down to just one single thing because tomorrow our system's gonna be different than it was today. So why would we wanna just fix that one thing when tomorrow that one thing's probably not even relevant? So anyway, sometimes we can be really bad at correlation and causation and it's a lot of it has to do with this cognitive bias. In many cases we become very stubborn. We're like, no, this is definitely what's going on and we're extremely confident despite all of the facts and it was a mental shortcut. It's a heuristic. It's our own reliance on them and they can cause predictable errors in our reasoning, decision-making and predictions. We can be very, very confident even though we are blatantly wrong. And then when the constraints of time and pressure, especially time to repair, influence our cognitive efforts, systematic errors are introduced into these judgments and choices. So we've all seen something like this and we all of course respond to our pager and start dealing with it but there are whether you know it or not. There are certain biases that start to creep in. This is another thing that I love talking about is the Kinevan framework. Is anybody familiar with this? This is awesome. You should definitely do some more research on this and maybe we'll do an open space as well. What this essentially is telling us, one, we know that we work in complex systems but this sort of breaks things down into obvious, complicated, complex and chaotic and all of our systems for the most part are going to exist in this complex space. Some of them might be complicated but really they're complex. They're always changing. They're very dynamic. There's a lot of unknown unknowns. When things are going wrong, we must first probe, then sense, then respond. There's a lot of emergent things happening and sometimes things will move over into the chaotic realm and once you get into the chaotic realm, the first thing to do is just act so that we can get back into the complex realm and start to kind of get back into our safe zone but when you look at the obvious and the complicated those aren't the systems that we've built. Our systems, well all I have to say really to prove my point is users. Users alone just makes our systems complex. They start going in and they do things that we didn't know that they could even do. We didn't know that it was even possible. So anyway, it could go on forever about the Kinevan framework so let's do an open space on that as well. But the point is there's a lot of complex steps, complex tasks that we have to do within our complex environments and a lot of those steps, let those things that we have to do, they can't be skipped. We actually have to step through them especially when there's an outage, when there's some sort of problem. But what we can do is actually practice all those steps. And if we drill, drill, drill we can actually become fairly automatic at a lot of the things that we do and some of the stuff that might have been more effortful might be a system two type of effort will then become more of a system one type of effort because it just becomes natural. It's just intuitive for us to do. So some examples of that. We're all familiar with the Chaos Monkey and the whole Simeon Army at Netflix but game days, doing these self-inflicted outages and the recovery efforts, this is just really just simulating an outage so that we can practice our recovery efforts, detecting it, acknowledging it, triaging it, investigating it, actually going through and step by step dealing with an outage sort of in a safe environment or somewhat safe environment but we get better at just dealing with this stuff. There's a guy named Lindsay Holmwood out of Australia. He works at the Australian government digital transformation office. He's given a couple talks on cognitive bias as well. I've got a link at the very end of my slides but on one of his talks, this was a quote from him but he went into great detail about how you really have to practice a lot. Otherwise, when you do get into a real life situation people tend to just sort of deliberate and they don't really know what to do, where to start, what am I supposed to, who am I supposed to call, what do I do in this situation? And that's obviously going to hurt your meantime to recovery. Another area that we can sort of implement to help with these cognitive bias and these steps is actually implementing run books. And run books are nothing more than just a document, maybe even like a wiki type of document or a knowledge base that tells me step by step or whoever's in charge of the recovery efforts, step by step, what is it I'm supposed to do in this situation? Because the fact is when things break it's usually at the worst possible time. And so if that's in the middle of the night or you're in your car, you're on your vacation or you're with your kids or whatever you may not have cognitively all the information you need to start taking a look at this problem. So if you can have those run books available to you that's gonna help out a lot. We've already talked about postmortems. Postmortems are great because they help us learn. Postmortem isn't just to sit around and talk about in a debate what went wrong and who's fault it was. That's like the worst possible thing you could do. It really doesn't matter who's fault it is. We don't really wanna ask why as much as we wanna ask how. Because when you ask why it tends to start leading towards who did what and then we start blaming and then those people become subject matter experts at not telling you shit. And so, so postmortems avoid the why's go for the how and we really just wanna learn about this. We don't really care who it is that may have hit the button that deleted everything. We just wanna know how it happened so that we can improve things and maybe prevent that. Or at least make it so that the impact is a little bit less. So, practice helps and then also paying attention to your mean time to repair, mean time to acknowledge. This is a report that shows me the mean time to repair for our teams. I wanna know what happened over there. On April 17th, why did it take so long? There's a little bit of a jump there. So I'm gonna ask questions about that because it's important to me to make sure that we are doing better about recovering from our problems faster. The main takeaway or one of the main takeaways from this talk is that we really can't overcome cognitive bias. It's really just important for us to know that it exists. Sometimes it's easier. Well, most of the times it's easier for us to actually acknowledge that it exists and see it in other people's behaviors. Sometimes we don't even notice it. That's another good reason to do postmortems is that you can actually spot other people maybe using something like hindsight bias. And hindsight bias is simply thinking despite all the evidence that we should have known this. We should have known this was gonna happen. And we should have done something about it before. That's a hindsight bias. Don't get caught up in that. And spot it for other people because it's hard for us to spot it in ourselves. Postmortems we already talked about but that's another great way. Normalcy bias, this is a cognitive thing that happens all the time. Just because your server hasn't held a meltdown doesn't mean it won't. But for some reason we're cocky and we think that that's not gonna happen to me. It never has, so it never will. Confirmation bias is another one. This is where we actually seek out information to reinforce what we believe. Now I may have done that for this talk actually, so. But it happens all the time. We believe something and then we go and we find evidence that somehow backs that up. With confirmation bias, we actually ignore alternate explanations. We interpret ambiguous information all in the favor of our own position. It happens all of the time. Here's another little exercise. Which of these lines is longer? Good, I heard some people catch, or no, they probably, but this is the Mueller liar illusion. And all it is, these two lines are exactly identical and if it's just because of those two little, they call them fins on the end, that our minds, it's an illusion. It's just an optical illusion that our brains sometimes catch, especially if you've seen this, but others, when they first look at that, they believe that the line on the bottom is actually longer. And so one of the things, actually I'm just gonna skip because we got one more exercise I wanna show, one more quick video. So similar to the one we looked at earlier, so hopefully you'll maybe do a little bit better this time. Players wearing white past the ball. So same deal, we're trying to count how many passes there are with the white. The correct answer is 16 passes. Anybody get 16? Okay. Did you spot the gorilla? Anybody got the gorilla? For people who haven't seen or heard about a video like this before, about half missed the gorilla. If you knew about the gorilla, you probably saw it. But did you notice the curtain changing color or the player on the black team leaving the game? Did anybody? Let's rewind and watch it again. Let's see. Here comes the gorilla and there goes a player and the curtain is changing from red to gold. So that's called the monkey business illusion and obviously it was a little bit of a setup, but I'm running short on time, so I'm gonna just list off a few more of the cognitive biases that are very common in our industry, especially if you're somebody's own call. Confirmation bias, hindsight bias, we've already kind of talked about. Availability heuristic we touched on as well. The automation bias, this is where we start to automate so many things that we actually, it can lead to erroneous, automated information, overriding correct decisions. Halo effect, we tend to think that other people, because they have a lot of good information or just for a variety of reasons that they actually are better than others. We've also got the framing effect. Also I've got a link to the slides on this, so I'm running a little short on time, so I'm gonna skip those. But the main thing that I wanted to share or sort of the last takeaway here is that there's a lot of things that goes on in our brains and the brain creates behaviors that we may not actually be aware of and there's implications to that, especially in the context of being on call and dealing with outages. So hopefully by the things that you've seen today and the things that we've talked about, you've got a little bit of a deeper understanding of yourself and you can kind of keep some of these things in mind, start looking into making sure you have good postmortems that you're conducting, you've got some runbooks in play and you're actually paying attention to that bias, especially with other teammates when you're starting to discuss issues. So with that, I thank you for your time. You can find the slides at that URL right there. A lot of the stuff I'd love to talk about either back at the VictorOps table or we can do open spaces on a couple of the topics. And that's all I got. Thank you very much.