 Again, as an engineer, I think you should arm your engineers to do the right thing and in general, get out of the way. And I think that's giving them the tools and the time and the resources to address important problems. But I'm CTO now, I'm not the one fixing the outages when my system has a problem. And we have outages from time to time. We were running a game day yesterday to go test how we handled a set of failures to go reproduce and verify that a past failure we had mitigated correctly. And one of the things I've learned is I have to make it a priority for my organization. Hi, this is your Saptan Bharti and welcome to your panel. Let's talk today. We have with us once again, Kultan Andrews, founder and CEO of Gremlin. Kultan, it's great to have you back on the show. Thank you, pleasure to be here. Love chatting. Talk a bit about the importance of reliability in terms of business continuity. And what does it mean? Not only for business leaders, but also developers teams, DevOps teams, DevSecOps teams, SREs. I mean, there are so many labels there. Yeah, I think, you know, being coming from an engineering background, that's the perspective I'm most familiar with. And, you know, in general, engineers want to write high quality code. They want their code to work. If something fails, if something goes down, if the company they work for has a big outage, you know, it's not a great thing. It's something that engineers would prefer to avoid, prefer to prevent happening. And I think, you know, there's kind of two pulling sides here. Engineers, I think, if given infinite time, would build very highly, you know, high quality, high scalability, highly extensible, highly reliable code. But that's not the world we live in. We always have to make a set of trade-offs. We always have to be thinking about how to, you know, efficiently use our resources, how to make our customers, you know, not just happy but excited, how to help, you know, enable the teams around us. And so that's often this competition, these competing concerns. You mentioned security, you know, I think security and reliability share a lot in common. When there's a big security incident, it becomes a huge priority for the news, for the media, for the company, for everyone inside. And that's really where security is given the power to go prevent such things from occurring. And that can happen one of two ways, the draconian thou shalt not, and they add a bunch of process and they tell a bunch of people what they can't do. And that really slows things down, but it probably makes things more secure. The other approach, as you described, is, you know, integrating with the engineers, working with the engineers to build it into their processes, enabling them to have the right information to make good decisions so that they are building that security. I think that same thing applies to reliability. Sometimes when there's a big outage, you have a thou shalt not rule that comes down. Use thou shalt not do such a thing in production. Code freezes are a great example of this. Oh, we're coming into the holiday season, Black Friday's coming up, nobody's allowed to push code for the rest of the year. You know, that's one way to keep the system from changing. Is it the right thing? I don't think so. I think it's much better to, you know, work with the engineers, help them understand what the risks are, help them quantify and measure the risks, help them mitigate the risks so that everybody feels better. Can you talk a bit about beyond security risk? What are the other risks that could lead to either outages or downtime? Once again, we are looking at the modern workload. I'm not looking at, you know, traditional data center world. Yeah, so I think, you know, the joy of microservice architectures and building these complex distributed systems is the famous joke that you don't know whose other system can bring your system down. And so in my experience, most of what I see is, you know, a third party dependency could be internal, could be external, but another dependency has a problem. They get much slower, they start shedding low, they start throwing errors and we've never encountered that scenario as service owners. And so we're surprised. We didn't get alerts on it. We're not really sure what's happening. Things just aren't working correctly. We've got to drop into debug, diagnose, triage mode. That's exactly the kind of thing I've championed in the past, preparing for large events, preparing to help teams get better. Do you know all your dependencies? Do you know what happens if your dependency fails? If it just goes away tomorrow, what's your user experience? Can you keep operating? And then what happens if it gets slow? If things start, you know, just lagging behind, are you gonna start lagging behind or are you gonna be able to cut that off in a way that means you can continue to serve healthy requests? So I think dependencies are a big one. We all deal with just multi-region, multi-cloud, multiple availability zones. And so there's always just some hardware that can fail. You look at the law of large averages and when we're dealing with data centers with hundreds of thousands of disks and hosts, failure goes from being very, very seldom to a daily occurrence. And so some things will fail, some network switches will fail, some disks will fail, some hosts will fail. We really wanna turn those into benign events. We wanna be able to just handle those without them being customer impacting. And that can go all the way from low-level EC2 instance, gets rebooted or gets replaced on you all the way up to, a cloud provider has a region failure for a couple of hours. Do you need to be able to continue operating through that? When it's come to reliability, of course, when we start talking about security, it's an industry, sorry, organization wide problem. But the problem is that when it becomes everybody's problem, it's nobody's problem. So when it's come to reliability, where does the bucket stop? And also, does it go the trickle up or trickle down approach? It should be like all the way from CEOs down there or it should be a developer movement. I've debated this one a lot personally. Again, as an engineer, I think you should arm your engineers to do the right thing and in general, get out of the way. And I think that's giving them the tools and the time and the resources to address important problems. But I'm CTO now, I'm not the one fixing the outages when my system has a problem. And we have outages from time to time. We were running a game day yesterday to go test how we handled a set of failures to go reproduce and verify that a past failure we had mitigated correctly. And one of the things I've learned is I have to make it a priority for my organization. If I don't tell my team that it's important, if I don't tell them that it's a priority for me, if I don't have a way of measuring it, a way of tracking it, then it just becomes one of many things and it becomes best effort. But if I continue that every Monday I have a staff meeting, I pull up our reliability score. I pull up the reliability tests that are run against our services. And I look at it, did something go wrong? Is something failing? Why is our score lower than it should be? And that's gives my team the air cover and the motivation and the prioritization to invest some time to go correct those issues. And every time we do it, we find something else that would have been a huge problem down the road. And so every time I do that, I feel good. I feel like I am finding and fixing problems that are ultimately saving the team time by just following up and asking questions and staying informed about the reliability of our system. How do these teams get visibility into, once again, something went wrong? Also, at the same time, they should not get the alert fatigue. And third is that they should also be empowered to take actions for whatever incident happened there. When I first joined the Netflix platform team, we had 650 alerts that we were for just our service. And one of my first jobs was, hey, I'm gonna go clean this up. And I went through and every time I came to an alert, I went to the team and I said, hey, can I delete this? And everyone said, ooh, maybe not. I don't know, are you sure? And I said, has anyone used this? Has it fired recently? What is it actually monitoring? So I had to do a bunch of homework. I had to dive into those alerts. I had to go look at the metrics. I had to go look at the history. I had to go ask myself if I thought it was sensible. And I think that's like half the battle. I was able to call a lot of unneeded noise by doing that. But that only tells me how well the system will do when everything's happy, in the happy case, when everything's well. And so for the other half, I really need to observe the system while things are going wrong. That's the way I know my alert is tuned to fire on that threshold where things are going bad and somebody needs to look at it, but avoiding the noise that happens underneath. That's where I need to go take this time out. Oh, we're talking to the recommendation service. We think on average, these requests take 100 milliseconds. We have a timeout set to half a second. Oh no, everything went wrong. Responses are taking two seconds each. Well, now our timeout doesn't, is that the right behavior? Do we want to fast fail those? Or do we want to wait? And that becomes a business discussion with that other team. How should the system operate? What's the right product experience? So I think that's the other half of the equation. We have to dig in, look at our monitoring and alerting and understand it. But then we have to go look at it while the system's under duress. And I obviously think the best way to do that is control thoughtful fault injection that lets you go test specific scenarios. That's how you tune your circuit breakers and your timeouts and your fallbacks to make sure they work when you need them to. And that might be fun to discuss is, I've been championing chaos engineering for quite some time. And I think one of the things we've learned is, it's necessary, but not sufficient. It helps us to be able to go solve a lot of problems, but it doesn't solve the organizational problem. And so I think we're kind of touching on these elements. Hey, what are the things people do to go make their system more reliable? We know what those are, we can walk through a lot of those, but how do we get the company to be more reliable? How do we get the company to care? And I was sharing a bit of my experience that I think it needs to come from the CTO, it needs to come from the top. People need to have that air cover, but it's so we can reward the right behavior. And I think you and I have talked about this before, but it's so important that engineers are out there doing a lot of good work. And if it's not unrecognized, unrewarded, unappreciated, people stop doing that work and they start focusing on other things. And so we can have a organization where we deal with fires and we reward the firefighters and that's who gets the attention. Or we can have an organization, as you said in the beginning, that is boring where we've rewarded the boring behavior. You've done all your tests, you've covered all your test cases, you've mitigated all your risks and you never cause an outage. And I think that's really for me, that's the change that needs to happen culturally for us to see the real improvement that we need in reliability. We got to stop thinking of it as the ops guys problem, the admins problem, just something we deal with and move on with. We got to treat it as quality software, part of a good customer experience and invest in it. If we do that, it's actually we spend less time on all of this. And our customers have a better experience and we're talking about this less, we have a new problem, which is like, hey, things are going too well and we've fallen out of practice. Do you also see there should be a company-wide culture to encourage these kind of practices? How to reward them? Also, once again, we need a whole cultural change approach towards reliability as well. I have two fourth question. Number one is that what you're seeing your customers, clients or the organizers are doing to reward these practices. Second is what they can do to reward these teams. Yeah, so what I'm seeing a lot is not quite rewarding, but it's approaching it from a compliance point of view. We have a lot of financial services, industry, publicly traded companies and they have commitments about their reliability about their business continuity that they need to demonstrate. And so one of the things that they've done is, this is something every team or every team, above a certain threshold needs to participate in. And through that, it becomes a form of coverage. They're getting a set of sense of test coverage because all the teams that are involved with that need this compliance work will have to go do this. And it's some basic stuff. Can you lose a host? Can you lose an availability zone? But proving that is actually something we don't really have. So I think this has been an interesting development for me in this space that you can go do the testing and you can go make things better, but what's your proof that you fix something? What's your proof that the system's better? And I think that's what's really lacking in a lot of like public S1 listings. People say, hey, we've got this business continuity plan, but have you ever run it? Have you ever exercised it? Have you ever actually enacted it? Because a plan that you've never run is wishful thinking in a lot of cases and you're gonna find all the hairy details. And if you wait until there's some large event, there's some big, big catastrophe and you need to be able to serve your customers out of a completely different data center or region or zone and you've never gone through that exercise, it's gonna take you days and it's gonna be very painful. So I think the businesses have a really vested interest in just validating those plans. They're doing most of the work, capture the outcomes, measure and report on them. And it gives us a way to look at this coverage, a way to look at what's happening. But as we discussed, I think it's gotta come down from the top that quality and reliability are core pillars. We need our software to work, we need our software to work fast, we need our software to work at all times. That is going to become the standard over the next 20 or 30 years. We're still in a very move fast, break things, innovation phase, but I've already watched the public's appetite and willingness to deal with failure dropped dramatically over the last 10 years. And so in 10 years or in 20 years, the public is gonna look at this a lot like, I turn on my water faucet, the water works. I flip my light switch, the power comes on and when I open my app and I book a ticket, it works. And when it doesn't, those companies that aren't able to provide that are gonna be at a severe disadvantage. Talk a bit about the rule that Gremlin can play or is playing now. When I talk about rule, of course, you folks can bring horse to the lake, but you cannot make a drink. But sometimes the tools, they do become catalyst of cultural changes as well because they kind of trigger those kind of collaborations between teams. So talk a bit about the rule Gremlin is playing here, both in enabling teams with the right tools at the same time, also bring the cultural changes needed there. Being in market for several years now, we had the opportunity to listen to our customers and hear their pain. And the pain, there was some pain around how they were doing chaos engineering, how to make it safer, how to make it easier. But a lot of the pain was, how do they drive that organizational change? And what we decided that was really needed, what was missing is a way to measure test coverage, a way to measure risks in the system, a way to measure the quality of the software when it came to reliability and then the ability to bubble that up and give that to management. And so we went out, we built a reliability score, we built a default reliability test suite. These are the, if you've never done this before here, the seven or eight tests you should go run. And if you do that and your system passes, you're doing pretty well, you're in a good spot. And so we built some dashboards and reports so that leaders could see that across their teams, across their services, who have taken those steps, who has the most risks, who has run the most tests, who has the best score, who has the worst score, who needs a little bit of help. So I think that's all been germane. One of the things that we've heard time and time again from our customers is that's great. You came up with what you think is a nice base test case. But in our company, we have our own standard of what good looks like. We have an architectural team we've thought deeply about. Here's what our tier one critical services have to be able to withstand. And so one of the things I'm excited we're rolling out right now is the ability for people to go build these test suites and then roll them out across teams and across the organization. So they can have a test suite for their SEV-1 services, their tier one services that is a really high bar. They can have another test suite for a tier two service where the cost of failure is a little lower. And so the bar is a little bit lower. But again, this gives leadership, it gives the CTO, it gives the VP of engineering, the VP of customer success. The people that need this information a view into where is my risk? Where is the system doing well and are we in a good position? And I think that's what's really germane to being able to go and reward the teams. What I foresee is a meeting where we still need to have an incident outage meeting and a post mortem where we talk through the failures and we talk about what happened. We also need to have a forward-looking meeting. What is our risk? What is our test coverage? Have we done the right things? And that's the meeting where instead of, yes, we need to pay attention to the folks that need help, but that's the meeting where we want to reward people who's never had an outage. Who didn't have an outage once last year? Who always met their SLAs? Who just delivered an amazing customer experience? And so in my opinion, for engineering leaders out there who are thinking about an award to give at a hack day or an offsite, think about who your most reliable team is. Think about who your quiet unsung hero is that keeps your system up and running and doesn't ask for a whole lot. And go give them some more. Go invest in that area of the business that's working well and you'll get your dividends as a result. Colton, thank you so much for taking time out today and talk about this topic. And as well, I would love to chat with you again soon. Thank you. Always, always enjoy the discussion. Great topics. Great host. Appreciate you having me.