 Welcome back to another OpenShift Commons briefing with the good folks from the GTO office. We have Andrew Clay Schaefer here with us and John Alspar from Adaptive Capacity Labs and other incantations of himself. And today we're going to talk about learning from incident, which... Incidents. It's incidents. Okay. I wish we could only have one. Only one incident, yes. Well, zero accidents this week in my household. How about yours? So I'm going to let Andrew and John introduce themselves and we're going to have a rolling conversation here today. So no slides and go for it. Andrew, take it away. Yeah. So I've talked to you before and I'm not sure I want to talk too much about myself, but I will talk about myself a little bit to introduce John. But the thoughts that I have around some of the things regarding DevOps and operations were definitely influenced by this man, John Alspar, and the way that he got to be part of some very, we'll call them generative projects and gave a talk that I would say essentially gave DevOps the movement its name. So there's this famous talk from velocity conference where John Alspar and Paul Hammond talked about DevOps cooperation at Flickr. And that like chained into a bunch of other things led to a bunch of conversations about DevOps. He is a big part of velocity conference. He also wrote some books and now he's really focused on and very passionate about this notion of learning from incidents and human factors. And I'll let him introduce himself a bit more and then we'll chat about that. Thank you, sir. Andrew. Yeah, I mean that's that's that's that's a great intro I have to say that I've learned as much from you, Andrew, as as you might have learned from me. Yeah, that that's that's about right. And really at the highest level is on my mind and my colleagues mind are introducing new ways of looking at how work gets done. And one of the most sort of effective ways of looking at how work gets done can be can be seen by looking closely at incidents. Not honestly as a pitch but explain what you do at an adaptive capacity and now kind of lay the. Sure. Sure. Yeah. And so, so what what we do at adaptive capacity labs is is help organizations it's a small consulting group help organizations understand how they learn from incidents currently. Who learns where that learning travels or dissipates and and how to how to glean more and richer understandings of their incidents to help them do what they're already doing, but tends to go sort of unnoticed and that is preventing incidents. And a great deal of doing this work means bringing a sort of a host of pretty particular techniques from other from research into human factors and cognitive work in other domains. But none of those techniques and methods are, you know, new, they're just new to being applied in in the software domain in the way that we do that from time to time. So everything from time to time organizations will experience a significant event, something that's really visible so it's sometimes advantageous for for them to hire us to do the analysis ourselves. A really bad oversimplification but I'm going to say it anyway is, you could think of one of the things that we do is you've heard of the NTSB in in in the US. Who's role is accident investigation in aviation and other transportation fields. You can think of adaptive classy labs as helping build a, you know, mini NTSB expert, you know, a cadre of people inside your organization who have NTSB like skills and expertise that they currently don't have. Let's reify this a bit when we say incident what we're talking about is the website is down or something along. I don't, I don't know, actually is as it turns out, and maybe what you've teed up for me here in an incredibly veiled way is that turns out the definition of incident is not as crisp and and standard and clear as as we might think. So I'm just a little from looking at real incidents. Incidents don't always show up with a big label on their forehead that says I'm an incident. Even part even even working out whether a thing is an incident. It can be a few weeks ago, Kat Swetl was on talking on another one of these sessions, and she brought up the slide that had a picture of like in a factory floor of, you know, zero incidents in the past, you know, 365 days or an event and basically the anecdote she was telling was whenever she saw something like that, it panicked her a bit because that meant they weren't watching for something or they were missing something because there's just like really no way that there wasn't something that they could learn from from these incidents. And so I think the definition of incidents is has lots of different semantic meanings and different ways. So I think that's a key piece of the conversation. Indeed. I don't know where you want to go with the john but keying off that. And this notion of what considered an incident is also in some cases, a question of blame. Right. So, so like, or attribution, causation. So, so I know you have lots of thoughts on this so maybe you could give us a little, a little monologue about about some of these things. Yeah, yeah, yeah, yeah. Well, the first that I think I actually I actually like what Diane had brought up and so I'll, I'll, I'll risk from, from that vantage point. So, you know, like I mentioned before, a lot of what we do is bringing new perspectives to understanding what makes work hard, and what makes people good at it. And what makes them what could potentially either support or hinder their ability to do work. The majority of the techniques and perspectives come from sometimes called safety critical domains, like power plants and medicine and military and transportation all that sort of stuff. We have to remember that categorizing. Remember, even just clearing a thing, a thing that happened in event as an incident, right labeling it as an incident is itself a categorization. So, the notion that that a that there's really only two sounds cartoonish, or at least I hope it sounds cartoonish to a lot of a lot of the viewers here. But quite often you'll hear, okay, in the in the wake of a an actually say, okay, well is this the result of human error, or technical failure, right. Whatever reason, the journalists, one, one, just one of those two categories, the, that frame that what makes an incident what makes not an incident is sort of beyond the scope of this but as a bit of a trivia. So Heinrich, in the early 1900s, put together this notion that you could characterize and by that even declaring a thing an incident or not, and even human error technical failure. And that was his sort of contribution. What's not often brought up is that he is that he worked in an insurance company. And having a perspective on a categorization is a political, as much as, you know, genuine curiosity. What stems from this is exactly what you, what, what you just mentioned, Andrew, which is, which is blame, blame certainly gets a lot of attention because it's a palpable it's a, it's telling a lot of human error or making it about the individual attributions of a particular person this is the, the, you know, the root cause was Steven, right, or something along those lines is really just a special version of even again. Exactly. And Lisa had a bail amount. Once more, the, the, the, the notion that we have uncertainty, and in that matter sort of an uncomfortability in the wake of an accident, an accident, meaning a thing that has some form of surprise and an adverse, you know, effects or whatever. To think, one, because that came out of nowhere, otherwise it wouldn't be a surprise in some way came out of nowhere, right. But to, to, to admit that those things in the future are possible. And, and the sort of ever present dread that they can't all be anticipated means that we have to put this sort of fear this general, like, oh my God, how can I feel good about the future, even if it's a lie. I'd rather feel good. Right. And so where can I place this this tension in the case of blame I can put it. We get uncomfortable with uncertainty. Yeah, well, yes, exactly. And in particular, we want a place of, you know, we want to hold up a totem or some some form of just put a place you know, scapegoating pile of sins of the village on the back of the of the goat and send send the goat out of the out of town. Right. Is the way so we need to need to put it in a box. Notice I didn't say container. We have to put it in a box. And, and sometimes that box is embodied in a person. Sometimes it's in a really big vague. It's in the system man, or sometimes it's in your cloud vendor or as long as there's a if there's a place to put the uncertainty. The, the, the underpinning this is developing an understanding of the incident. So can be. Yeah, aren't you missing just a little piece though, because, or I'm sure you're not missing it, but because there's always that phrase that failure is where the innovation comes from. So, but when we put things in a box or a, you know, container, whatever. There's also where you stop looking in a sense. I want to make one quick comment that I think might help the listeners, which is John and I have spent hours and hours talking about some of these things over the last 10 years. And, and I think that there's, it's in our best interest to articulate that these quote unquote systems are neither human nor technical. It's social technical. Both of those things together. And then also add, and I think this is relevant to the open shift community. There is no organization on the planet running any of these systems that thinks that the system themselves are fully autonomous and that and that their, their reliability is not dependent on the actions of those, those human entities and agents to keep the reliability. Yes. Yes. Well said. I think we've probably been, we've probably spoken on the order of days Andrew on these topics over the years in total. We're also getting old. Yes, we are. Yeah, and, and, and, you know, a big part of taking a, you know, taking some of these perspectives. It can be somewhat mind flipping actually so Diane you mentioned, you know you said something about, you know, failures sort of where innovation happened, which is, it is undeniable. One of the things that that I've, that I've come to understand in a really deep way is something that's quite unintuitive, which is that. Success is, you know, understanding how people are just playing doing their work. It can also be a significant source of innovation. Right. You can think of many products in the world, very, you know, very successful businesses that that turned a product turned a what was otherwise a work around. In a previous product into a significant and really groundbreaking service. I think of CDN is a great example of that. And so, but the difficulty, and this is the difficulty with the field of resilience engineering is, is that you have to. You know, I can't just say, alright everybody at the end of the day let's get together and let's talk about all the ways that the site could have gone down but didn't. Right. Because there's not enough time. The next we'd be there till the next day. And so, and this reflects in the same thing of safety, right, the denominator in cats. In that slide. It's a great example in the world of safety where you see those signs and those signs actually do you know are in a number of places. Notice the denominator is missing, which is one it takes for for an account that an incident is all incidents are the same. It also doesn't count how many incidents were prevented. It only shows the ones that were there. And Eric Holnagel has said that when you start measuring things. Holnagel being a sort of pioneer of resilience engineering when you start measuring, we start measuring things by what is not there, you run into some difficulties. And you can certainly prevent a lot of scores on goal. But if you're not scoring, then it might not end as well the way the way you think, but again, for resilience engineer. Do I have a pity. Okay, yeah, improvised I would say that resilience engineering is a study of both. Resilience engineering is the study of that is to say, adaptive capacity investments in adaptive capacity, playing out in real world situations, standing on grounded and concrete empirical evidence resilience engineering. And the study of resilience stands on understanding what resilience looks like to begin with. It's a field, it's a domain, it's community. And it's and it's, it's 20 years old. At least. And, but it's only maybe five years old with software and technology, starting to bridge and understand those. Not very pithy. Not very. And I think, I mean, one of the interesting that like side note is there's things that were emerging in practice that were gravitating towards what you just described as resilience engineering that definitely predate the five years that you're giving it the label. You're absolutely right. And, and that is the thing that was fascinating the reason why I was able to you know when I became first interested in this did my master's degree and started and sort of continuing reading. When I, you know when I contacted the heavies in that field. I was, it was Richard Cook and Dave Woods and Cindy Decker and, and, and, and Steve Shorrock and others. I should probably have come out in the introduction, but just for the listener, like walk walk through like how you got there right so you you run these websites this way. Yeah, it's a little bit of that arc to where you gravitated towards that field. Sure, sure. So, so we, so I worked in a photo sharing website called flicker as you mentioned, we were, we were acquired by Yahoo, but for the most part, we were sort of our own through standalone sort of entity. And we grew in ridiculous ways. I mean, in like cartoonishly that misfit like stratosphere ways. We went from being like the 25th most trafficked property at Yahoo to the fifth most trafficked, like behind like the front page and mail, right. And in, in like 18 months, the complexity of the back end of the of the website all of the things that made things work is kind of exploded. And at some point, you know, I had a team of up six infrastructure engineers. And at some point, we had some big outages some some pretty significant outages, but I couldn't get over the fact that on paper we should have had way more. And I couldn't understand what that what's that about and actually some of these, you know, having been sort of part of responding some of the incidents after you know you work out the incident and all all incidents can be really harrowing. And it's kind of like, okay, after, after math, okay, wow, that was bananas that was crazy. It's like, yeah, it's kind of crazy that we even worked out what was happening. Like, yeah. And so I couldn't as a manager I was I was thinking to myself. What's what's going on here, either I'm incredibly good at hiring and then like being able to do this work is sort of innate you're born with it or something or whatever I just happened to you know, strike gold with the people on my team, or they were certainly pretty I'm an amazing manager. And that's what like, so both of those are completely unbelievable. Certainly, the latter one would would have been existentially difficult to accept, because I would have no idea what I would what I did to make that. So I started to under started to look into like, what makes what makes what what underpins people's ability to solve a problem not just a solve a problem but solve a problem under time pressure, where any of the actions you're taking could, could very well be worse, right, and represent in some in definitely cautionary tales and existential business, you know, situation. And that's what led me to human factors and what I understood about human factors is this, you know, fields, you know, most of us understand ergonomics ergonomics is quite often seem to be a sort of a specialized sort of subset of the field, and Britain, you'd say that was the field of human factors is the subset but the, the fact of the matter is where technology work and people, you know, happen is this field what I what I realized was something happened in about the 70s. And part of human factors traditional human factors, started to undergo a sort of a, again, an existential sort of wait a minute, we don't, maybe we we actually don't understand this stuff. Three mile islands was the point, the whole, the whole planet that was doing human factors work was like holy crap. No, actually, you can't design a situation, you know, an operations room without taking into account the cognitive work, not just like plain old, can you see the dials and all that sort of stuff, and cognitive systems engineering sort of was born. And it's a very, I wouldn't call it a splinter of a certainly it's a, it's a field in and of itself. Don Norman, Dave Woods, these are, these are folks that were almost entirely came from nuclear research and nuclear power plants, but then went on. And so this day, even though resilience engineering as a field resilience engineering is a pretty broad field, because it's got a lot it's it's entirely there are sociologists there are operations research or statisticians there's lots of people core part at this juncture is cognitive systems engineering. It's not all of what represents resilience engineering, but certainly a core part of it, much like, you know, statistics is a bit of a part of computer science, or, you know, and mathematics so these things sort of interrelate. That's a little bit of my, you know, background of how I got there. And I'm still learning so that's that's the that's the gist what I the final thing I'll say is that the is much more rewarding. And the thing that I am excited about is that, you know, much like continuous, you know, delivery continues deployment. The notion, all of the things that we associated with that, both the things that enable it, the rationale for even thinking about thinking about it in those sense. There was nothing that, you know, there was nothing special about that 2008 2009 timeframe. Like, all of those ingredients had been set up you could argue that extreme was a big that you know was pretty much the thing that tipped people down that road. But, you know, it's like one of those things when you look at it like, Oh, yeah, it seems so obvious in hindsight, and it was pretty straightforward, you know, small and frequent changes for these for this reasoning, and you'd need these to do this sort of straightforward but it is a perspective shift I mean you I think might my guess is that both of you were were there to sort of see this perspective shift light bulbs go on. The perspective shift is not evenly distributed. You're absolutely right. You're absolutely right. And so, yeah. So how does when we talk about resilience engineering and cognitive systems engineering. The work that how you applied that maybe not at Yahoo but afterwards and stuff and tease that out a little bit because the thing that actually sprung into my mind was how we tried almost to automate that in software with things like chaos engineer chaos monkey and things like that like, which doesn't take into consideration the human factor at all. It's just like, well, it tries to simulate it but there's no it's like running tests after tests after tests and and doing this to your website and stuff. So like, can you pull that tease that out a little bit more. Yeah. Yeah. Yeah. So, actually, I want to I'll I'll comment a little bit on what you sort of what you what you just mentioned with respect to chaos engineering as kind of an example of, of how sort of application could look. So, myself, Nora Jones Casey Rosenthal, there are, and others. Matter of fact they have a there's a new book at from O'Reilly on chaos engineering has pointed out actually that the certainly one perspective is the one that you describe. Another perspective is that the, that the creation of a of a chaos experiment. The process and practice the dialogue that generates where how when an experiment ought to be ought to be performed can be as valuable if sometimes even more valuable than actually running the experiment. In which case, this is a capture of cognitive work. And so, so the, the, you know, what I would say is matter of fact actually I because I was just read, let me just read this here. This is an interview that Nora gave you. Jones also states that before and after a running a chaos experiment is as important as running the experiment itself. And so the, you know, the, how does how does the application of cognitive systems engineering look. Well, it's the first real sort of application was in my master's thesis, which is to figure out what what rules of thumb or heuristics engineers use when trying to resolve and understand and respond to outages, especially when signals as we know can be disparate sometimes contradicting sometimes not make much sense in in when faced with an entire, you know, see or or almost infinite number of places to look you have to look you have to you start looking somewhere. You have to look in some places, rather than other places. And so this is a this is the study of cognitive work. My thesis, which you're happy to download in case you're having difficulty sleeping will sort of go into into detail there. I would say the significant work is if you were to, if I were to give you a couple of threads to pull on. So look into methods techniques approaches and it's an entire family of things that make up what's known as cognitive task analysis, cognitive task analysis is more or less the formalized method with related cognitive work like CTA and CWA. And all of these tips and tricks that go into that is the application of cognitive systems engineering. You can think of those are the tools to understand how people understand and how people wrestle with both in teams and also individually problems that they're that they're facing problems that they're anticipating and what those problems in anticipation or in in in responding to what they mean. So what comes out of that is this closer look and what we what we always like to say is look the expertise is coming from inside the house. There's much more. Yeah, there's much more to understand about how people do their work that is represented in JIRA. I would add there's a tendency in all of these practices, especially when you're kind of outside of the core conversations to focus on the tools because you see it as a concrete representation of what's happening. But in my mental model and the conversations I've had with some of the people you just mentioned, I feel like the core chaos engineering community and you know the stuff we're talking about with cognitive engineering resilience like those are essentially inseparable in my in my head. Yeah, absolutely. And what I what's exciting about chaos engineering is not only the original, you know, a lot of the sort of, you know, proponents, even the earliest proponents of chaos engineering are are seeing this connection and they're seeing this connection in in ways that is, for me, really satisfying. And I'm, they're making new connections that is between resilience engineering and chaos engineering that I had, I wouldn't have even seen. And so that's really satisfying super happy about that. There's something in the chat that remind me of some of the stuff I've seen you talk about before that that might be fun to articulate here, which is this notion of the kind of the lines of our models and how the process of incidents and analyzing them helps us build clearer models. Yeah, yeah. Interesting points. Yeah, this, this notion of this line of representation, it's a bit of a mind blower right so the, and this is entirely from the worked out in the snafu catchers consortium and is described in a lot more detail I'm not going to be as much as eloquent here but in the stellar report describes this sort of frame and the frame, the frame goes like this we have all of the stuff that technical, we've got databases we have, you know, we've got the thing that we build. Here's the thing that generates, you know that users use it to generate revenue. Here's the stuff that we that we that we build and we're trying to help us build that thing. Right. And here and and all of the things that sort of intertwine with that including like dependency and all this. So we've got all the stuff that that sort of fits together databases and code repositories and networks and firewalls and all of this stuff. We manipulate that stuff. We do things with that stuff via a representation of that stuff. It's not with the stuff. Right. When you go to make a schema change you don't go to the data center and do a thing physically to the database right. And, and, and what would and so what we everything we know about that world is via these representations they're not the things their representations of those things. Right. Distributed tracing app is a representation to the extent that it's useful. It is a representation it's not the thing it's not, it's not, you know, it's, it's not the thing that you hold, you know, you can look at it and say, Oh, there's that. And what that means is that people's ability to not only make changes, but also anticipate, anticipate what the system might do in the future. Where it came and is based on where it came from all comes, it comes from nowhere except for their mental model of the world. The work we do is both facilitated and limited by by these mental models that we built up about what we're working on. Yes, yes, exactly. And so what should surprise and in addition to that, what incidents and what what close close study of incidents shows is that no one has the identical mental model of the same, but this below the line stuff as others. They may have some that are close. They may have more. They may have more detail in some areas and others. And what's happening is that teams are continually recalibrating these mental models through discussions with other other with others through looking at dashboards looking forward, writing new code, seeing how that behaves, and that it's in this constant recalibration. So we have overlapping mental models, but and so what's surprising is they're never complete. They're always faulty in some way. And stuff works almost all the time despite that. And the reason why it does is because only people can adapt and recalibrate that mental model. It's not the stuff below the line. It's not the, there's no intelligence that goes down there, other than what what have what has come from us. It's not the below the line stuff that's doing it right it's it's our ability to make sense of what's happening what's happened in the past what's happened, what's happening right now what makes that matter. And what makes what might matter in the future important to pay attention to. And so that's the notion of above the line below the line. And to go back to something early, very early in this conversation, the, you know, the blame game. And, and I come from a perspective of open source community development and trying to shed sunlight and so when there is an incident. One team has their mental model of how things are working. One of the things that I try really hard and is almost as very hard to get people to do is to share their model. It's almost a cultural shift because it often it's inside it's something that went wrong with the product or a service or something like that flicker went down or, you know, somebody went down. And they're very reluctant to like have an open dialogue with the user community about what went wrong because then maybe they'll shift to another service provider, or, you know, something like that so there's this, and I'm just wondering maybe from both of your answers, you know, how you help companies and organizations understand that putting some sunlight on your mental model, or your, your models might, and exposing them sharing them with people and how to do that effectively, and allow other opinions because going back to where the innovation happens is those aha moments of Oh, yeah, often come from outside perspective. I want to add, before John goes back to answering the question for real that this this occurred at several layers and levels as well so internally. There's often, you know, and this is people talk about job security, wherever people will will protect the mental model that they've constructed and not share it internally. And then that, you know, also happens between between teams between departments and then, as you mentioned, externally, but at the same time I feel, you know, in the velocity conference and the community around devops days and the rest of that over the last decade or so has has essentially started making post mortems or incident analysis publicly into an art form so I'll let I'll let John make his comments on that but this doesn't. This isn't just between the organization and the outside. We protect our mental model scales. Yeah. Yeah, yes. And to the, you know, to your observation then that there's that there can sometimes be reluctance, right to giving I wouldn't say sharing mental models because I can, I might make a point about that but really even just really relaying any sort of information about what was what was happening for them at the time. Yeah, doing a public post mortem on something that's very that a public service outage or something like that. Sure. Huge reluctance from engineering teams to do that and that's sure. Well, well, I mean, so, well, I mean, it is right. If they believe that they would get something from it, they would do it. If they believe that that that and this is internal and just like just like Andrew said, and external there's nothing that there's some, you know, peculiarities about in, you know, write ups about incidents to the public. But remember those are those are the purpose of those. I mean, the importance for those is very different than an internal right and and to mistake the two as being similar is different. The point that you brought up which is reluctance. I mean, there's a reason why there's that people are reluctant right if they think that they can get something if they think there's something positive and they feel supported in giving a story. I mean, great if there's if there's something that is potentially threatening for them, or others, then, then they won't. Right. And so, remember the, the, and just the somewhat of a potentially nitpicky point on mental models is that people I can't ask you for your mental model, you can't give it to me. You can't, you can tell me, you can tell me a story from from from a from a cognitive technique. And that when you ask somebody something about how they rationalize, right, something called reflexivity, you will give the answer that you think that the asker, the requester will will will give. You have to build a constellation of data that supports this mental model calibration recalibration, and you have to, and that's about the mixture of of records of what people do what people say, and what people do and say about what people do and said, including others. This is called process tracing, but it's the way that you can do way you can make ever valid inferences about cognitive processes. Sorry to get really nerdy there for a second, but this is the reason why, you know, this is this is what makes doing this work difficult. We don't share a thing that they think everybody knows, or they aren't even aware of themselves. A famous famous researcher in this in in in the late 60s said it quite quite best about tacit knowledge. We can know more than we can tell. And a significant part of studying cognitive work is exploring tacit knowledge and there are some ways that you simply cannot do it. And you have to learn to how you have to learn and practice how to do that. Otherwise, the results aren't valid. And there's only one thing worse than a really poorly captured incident write up, and that is an incident write up that every everyone, despite its contents, finds to be non credible, because the authors and the methods by which that was that was formed is seen to have an agenda to the effective incident analysis requires an analyst to be a non stakeholder. There is no stop full period. There is no other alternative. You need to have no stake, no dog in that fight no horse in that race about what the analysis does other than provide others. A boundary object a source of dialogue. Isn't that exceedingly difficult to have no agenda. My adaptive capacity labs is an expensive professional service. Maybe it is if it was easy, we'd already be doing it, you know, what's the difference between, there's a reason why, and let me be, let me be super blunt here, the world of human factors. It's been said, my many colleagues of mine now have said this. The origins of cognitive systems engineering human factors, all the stuff we're talking about a definitely cognitive task analysis. The reason why a lot of this comes out of research in the military, do d and do we funded projects in the US and and and other parts of the world is because of the because of consequences and time pressure. And it's, you know, it's jokingly said that you, you're, you're doing this work either because somebody who was supposed to get killed didn't, or somebody got killed who shouldn't have. And, and, and that wipes away consequence and time pressure wipes away anything else that is immaterial that's what makes incidents. That's the Trojan horse. I think, you know, that it's a myth to say that that using these techniques, looking into incident analysis isn't there. The focus isn't necessarily to find what broke. It's not some, it's not some sort of socialized debugging is to find out how stuff works at all. The incident is just a director of attention. The incident is just the, you know, the filter. You know, and you can think of an incident as your system saying, hey everybody, you just, hey everybody listen, you really ought to come pay attention. There's something this doesn't work the way you think it does, you should come right. It's incredibly efficient. It's very efficient in that way. It's the opportunity. It's the opportunity as the opportunity. Exactly, exactly. And to that I would say, people who don't have the skills on how to do it are only going to get so much out of spending an hour in a conference room filling out a template. Sorry, slightly snarky. Well, I was going to say the other thing is that a good incident report or a good post mortem doesn't necessarily tell you what caused the incident. It just gives other people information that they can can help you sift through and maybe sparks conversation that gets you to that opportunity. Yes, and in order to do that. It needs to be compelling for the broadest audience in the deepest ways possible. Engineers need, I mean, this is something we know about software engineers. They don't read anything. They don't think they need to read. And when they think they need to read something and they have an expectation they're going to get something out of it, you're damn right they're going to read it. And so doing that, capturing what makes incidents hard. Capturing what makes, you know, red herrings and wild goose chases happen because following those have worked in the past. Right. But you never you very rarely see the details of red herrings and what made red herring so attractive to follow in incident write ups. Very, very, very rarely do you see that. That's an example of something that's an example of the messy details. That's really important. The other outcome of doing post mortems and insert reports is also building trust. When you share that information, you're building trust with the other folks across silos internally or your end user community, that you're sharing this information as opposed to withholding it and not exposing, you know, the, the things that might have led up to it so there's. And the hardest thing is to do it well. And yeah, that's this is proportionate you're right about trust, but but that trust is proportional to the quality and what others find of interest in the report. Which is why I'd say what a very strong signal not the signal but a very strong signal is how many people read how many people read it, you know, how many, if you can't and I'm going to go out on a limb I know that counting and doing statistics on how often somebody has visited a web page is a solved problem, their entire there's I know of a company who's built their entire business on that. But yet, if you know the way to build the way to break trust is to is to make available all your incident write ups that are terrible. And it's a scale to do to do good ones it's a skill and it shouldn't be done lightly. I think it in a lot of organizations, it's a mandated perfunctory action and that's the problem that the john strun exposed that we're kind of coming towards the top of the hour and given the fact that not everyone has has john all spa on retainer at this point. What kind of practical advice, would you give to someone listening about where to start where to explore, you know, what can they do that that would make some maybe meaningful changes to their own mental model, not necessarily about their systems but about this type of work. Yeah, that's great for the record. If everyone was interested in having me on retainer certainly please reach out. The. So that that is a great question. So there's, there, there are two things that I would, I would suggest. The first is to understand that there's a growing community, who is, who is it's not just adaptive trust your labs. Right. There's a website called learning from incidents, you will see reflected in a lot of blog posts, more and more people talking about these, these topics, I'm happy to sort of tweet much more I'll say that the, the learning from incidents page and in particular, Lauren hoxstein on GitHub is has written an absolutely stunning sort of a set of resources about resilience engineering understanding cognitive work that you can look at. And then dramatically, practically, a couple of suggestions. The first is to make effort to capture from as many people as you can. What was difficult. And put it put it in a news put a new section in your postmortem template or wherever you want to and get people to to write. What was hard. What was surprising what was difficult. The more people, not, not what they thought the team thought was difficult, not in an abstract way individual perspectives individual perceptions. What was hard was difficult, the more that you can. Lots of things are difficult it's not just sometimes, even, even understanding the thing that you're seeing is bad can be difficult. So, gathering those sorts of reflections, right every every engineer has this feeling this, this sort of when we've talked with organizations. We I'll ask, have you ever, you know, have you're about to run a command you're responding to an incident you're about to run a command and everybody thinks you should do this. Right. All your colleagues like you should do this. This looks like it's the best shot. Okay. All right, I'm going to do do it. Have you ever had that feeling that right before you hit enter. There's an equal chance that this might make things worse. Well, on a regular basis. That's a palpable, extremely important experience that almost never finds its way into these narratives. So capturing what makes work hard, what makes work harrowing. You know, absolutely astonishing. You know, there are, there are surprises that are absolutely fundamental, right there's this notion of a situational surprise that's when you buy a lottery ticket and you win the lottery. Okay, and then there's fundamental surprise. And that's when you don't buy a lottery ticket and you win the lottery. Okay, fundamental surprises are what make Chernobyl, they make the bats IPO they make night capital. They make the three mile island. They make, they make accidentally sending a ballistic missile alarm to the entire state of Hawaii. And so, so capture that that's my pragmatic advice capture that stuff, put it down, people will read it, because they've been in that situation. Okay, what do you think about that? I think I could sit and talk to you all day. Always. Well, definitely have you back. I think there's a piece that that I'd also like to tease out is because, and again. Andrew is like maybe focused on organizational change and transformation and DevOps and, and I kind of have, I'm trying to figure out how to apply this to some of the open source communities that we're helping support. Because doing this in open transparent processes, as opposed to maybe in an enterprise process, which is very, very important because I'm sitting inside of, you know, good old red hat and, you know, this stuff happens all the time and we do have a great engineering team and they, you know, they have read all the books and, you know, they've done and they actually apply a lot of this stuff. So it's great. It's been wonderful. But then when we take it and we have to do it in the open. Yeah. And, and when I talk about sharing that, you know, you know how we do this in an open, positive way and learning the practices in open source communities, which is something that now that I've read the books and now I've heard you speak and I've heard Andrew speak and everybody is trying to figure out how to take this to the open source community work that we're doing. Very good. It sounds like an amazing and excellent challenge, excellent, excellent topic. But cool. We are at the top of the hour and we're going to hit a button soon and end this conversation and that's going to make, you know, a fundamental issue for all of us because we'd love to have. It doesn't have to end the conversation. You can reach out to us on LinkedIn or Twitter or what have you. Yeah. But just we'll just end it for the day. Yeah. And I'll try and find many of the references that you spoke up the both of you spoke of and add them to a resources page for this. This conversation when we posted up and definitely have you back again and boy. Lots of things to think about now over the weekend and ongoing. So thank you very much for joining us today. I'm very happy to talk with you Diane and I always love talking with you Andrew.