 This is the f5 dirty words of CI. My name is Paul Reed. This is the first time I've done it ignite talk So this is either gonna be awesome or horrible. Not sure. I'm a DevOps consultant. That's what people call me I'm not sure I feel about that, but that is what they call me. Oh crap. Here we go. Okay See I who here does continuous integration? Great, we're not gonna be talking about continuous integration. See I in this context is continuous improvement specifically Postmortems operational postmortems. Hopefully they're they're blameless. We're gonna talk about the five dirty words that you might hear root cause analysis Who does this right? This is the idea that if we look at a problem an incident We could find the linear cause and if we found the root cause and removed it from the system We would avoid the incident entirely together now And I talked to various teams one of the favorite things I will ask them Oh, did you find the root cause of that incident? They're like, yes, we did here are the eight root causes And I'm like, no, it is not a root cause if you have eight of them. That's not that's I think you misunderstood how that that works So what is a better option that we can use? Well, we can talk about proximate causes, right that there are multiple Causes of failure in our systems that can cause an incidence But also remember that even if we solved all of them doesn't necessarily mean that we're going to not have the incident again So be aware of that when you're thinking about that five wise This is another sort of linear model of how we think about accidents, right? Who does this right the five wise the toy? I think why why why why why right? I always feel kind of like this when I watch people do the five wise where it's just like You know the parrot why why and after a while the teams like not no I don't I'm done with this question, right? The other odd thing that I see is team seldom get to five They stop around two or three which is odd to me and then the questions that they ask are Always kind of all over the map So five wise just know no no no no Look into the Swiss cheese model or the systemic model for better Examples of accident models that are less linear and actually models that model the systems that were actually in Human error, right? How many of us have said oh that incident was human error because someone typed something on a keyboard, right? human error is a An explanation that we use it's a decision that we make where we draw the line to stop looking at things Now you might have heard about a little service called s3 that went down A few months ago if you read their retro they don't use the term human error at once They talk about the operational things that they learned about their scripts and their service that made it a very fascinating Incident but even though an engineer typed a thing that was Contributory to that they didn't use the term human error So stop using human error as an explanation for things. It's not actually a thing human error is not the cause of failure It's actually an effect of failure If you'll hear this a lot in retrospect is why didn't you why didn't you notice that monitor thing? Well, you should have done this right these are what we call counterfactual statements The reason they are called counterfactual statements is because they talk about a reality that does not exist It's almost like we're talking about these alternate or realities of things that did not happen It's like well, I did not look at the monitoring so I don't know why you're yelling at me about not having looked at it, right? what we need to do is Look at what actually did happen and try to figure out why it made sense to the person that did that because if it made Sense to them it probably is going to make sense to somebody else and that's what we want to understand is why it made Sense to the person doing the work in the system and explore that not talk about things that actually didn't ever happen Best practice who likes best practices? Who likes other people's best practices? Yeah, right. Okay, so one of the funny things is we always look at these dev ops unicorns We look at the Netflix and the Amazons of the world and seen they're doing all these wonderful things and it's like That's because they didn't do this best practice thing, right? This is a quote from a conference. I was out where they were like oh management loves best practices Yeah, like Netflix and Amazon did a ton of best practices, right? The Problem with best practices there best practice is not applicable in the complex and complicated systems in which we work You should be talking about good practice or if you're going to use best practice It's applicable in what's called the obvious or simple domain so use it in the right domain But you're probably talking about good practice not best practice so a couple takeaways Continuous improvement is not linear nor it's it's continuous So it's you don't just do it once and we have now improved you need to respect reality We work in complex systems and finally you need to treat people like the professionals that they are now If you actually wanted the five dirty words of continuous integration here They are broken builds if you're doing CI you should be fixing those builds So there shouldn't be broken builds on your CI system all the time flappers or tests that go red and green and red and green If they do that then remove them from your test suite because they're not actually providing any value Bob's Mac mini out local This is apparently I see these beautiful systems where they have AWS configuration manage Jenkins and then like all their mobile stuff is done on Bob's lap Mac mini on his desk Merge windows that means you're not merging back to trunk often enough and Jenkins build numbers means you're storing artifacts and Jenkins Don't do that. That's all I got