 The mic is on. My name is Kelsey Peterson and today I'm going to be talking about simulating incidences in production. So it's 3 a.m. I just started working at Stitch Fix a few months before and it was my first on-call rotation and of course I get paged. So I roll out of bed, I open up my computer and I'm getting errors that the application won't load. And I need to fix this issue immediately or as quickly as possible because the users of our software use it at all times of the day. However, at first glance it was hard for me to find metrics on how many users this issue was actually affecting. And so after 10 minutes of digging through logs and graphs, I see on this custom dashboard that we're experiencing issues due to a dependent service being down. But it's a service that my specific team doesn't own. So I'm like, what can I even do about it and how is it affecting our users? Like I said, I have little help from our dashboards and I don't know how to immediately solve the issue. And I can't punt it till the morning. And I think this is a common problem for engineers. We're expected to support applications that we don't fully understand and we don't fully feel prepared to do so. We infrequently get time to practice and hone our response tactics since on-call the issues that arise are not always the same. And we only respond to these issues, like I said, in real time. And the ultimate goal while we're on support is to solve the issues as efficiently as possible, not necessarily to learn from them. And so this often makes us feel confused, stressed out, and possibly even incompetent at our jobs, which obviously isn't great. And so our jobs, I think, consist of two main duties. We are expected to build new, shiny features to satisfy our business partners, to satisfy the business, and to bring in revenue for the company. And we're also expected to support existing features. But I feel like that because we neither feel ready nor practiced to solve these incidences efficiently, we're actually not that practiced in being able to do the second part of her job, being on support. And so today I'm gonna be talking about how practicing incidences in production can help us do a better job at building new features and supporting existing ones. So a little bit about me, like I said, my name's Kelsey Peterson, and I work at Stitch Fix. Just out of curiosity, how many of you guys here in the audience have heard of Stitch Fix? Man, okay, that's amazing. It's cool over time to see more and more people in the audience be familiar with Stitch Fix. And so for those of you who didn't raise your hand and aren't familiar with Stitch Fix, Stitch Fix is a personalized styling service for both men and women. And how it works is you fill out a profile online, you give us your different style, fit, and price preferences. You're matched with a stylist. The stylist is actually the one who selects the items of clothing. They show up on your doorstep. You try things on, you keep what you like, and you return what you don't. And so I work at Stitch Fix on our styling engineering team. We build and support the applications that our stylists use to actually select the items of clothing that are sent to our clients. And it's really important for these applications and for our software to have high uptime, not only because it's important for our business to be able to function and run properly and kind of keep the lights on, but also our stylists are dependent on being able to do their job to get paid at this hourly rate. And so this talk today is gonna have two kind of main themes. First, we're gonna be talking about injecting failure into our system. We're gonna do this by simulating incidences in production. And then second is what is the result? How are we gonna build more resilient systems as a result of running these simulations within our software? And so this actually isn't a new concept injecting failure into our system and practicing. There's a lot of other professions that run similar types of strategies and practice and training. One example up here is firefighters, which is sad but is very relevant, I think to us being here in LA this week and seeing the great work that all the firefighters are doing. But they go through months and months of training where they're expected to practice and respond to different types of incidences before they're even put on the job. Another example of this is doctors. So doctors obviously go through med school, they go through residency and practice these procedures hundreds of times before they actually work on patients. And while our us as engineers, we don't deal with life or death situations necessarily. But we do practice incident response on an almost daily basis while supporting our applications. And so today we're gonna be talking about how making practicing incident response a priority within your team to enable us to support building more resilient systems. So taking a step back, I think there's potentially a few reasons why we don't do this already. First is learnings. It can be unclear what are we actually getting out of running these simulations in production. Second is time. So there's always competing interests. You wanna be building new features. You wanna be satisfying the business. You may be getting pressure from your PM or from other managers. But I think today we're gonna be talking about how you can see that there's huge benefit to running these types of simulations within your system. And then third, I think there's this perception that running simulations in production can be really complex. It may take a ton of time to implement. Netflix pioneered the idea of chaos engineering or injecting failure into your system with the concept of chaos monkey, which is perceived as like being really DevOps-y and maybe outside of a lot of our skill wheelhouses. And so just out of curiosity, how many of you guys have heard of the concept of chaos engineering? Cool, awesome. So like I said, it was pioneered at Netflix probably five, six, maybe seven years ago by now a director here at Stitch Fix. His name was Bruce Wong and he led the team at Netflix. And so I'm gonna be diving into a little bit about how chaos engineering works and then talking about how we do that at Stitch Fix. So there's a few kind of key parts to chaos engineering. First is the scenario. So scenarios are kind of like the playbooks. What are you actually gonna be simulating in production? And what are you trying to have fail? Scenarios can vary widely across the board and it's very application in team and company specific. But I'm gonna give you a few examples for you to start out in a few minutes. Second is a team. So chaos engineering, a huge part of it is kind of gathering expectations, talking about it with your team and discussing. And so it's really important to be thinking about when you're simulating these incidences and running this type of chaos engineering within your organization, what you really want everyone on your team to participate. And then third is the game day. So the game day is when you actually run the scenario with your team. It's usually a specified period of time. You're all in a video conference and you're actually running the scenario. So like I said, today we're gonna be talking about chaos engineering at StitchFix. This is a project that I led with the mentorship of Bruce Wong, who pioneered it at Netflix and a handful of other people who helped kind of bring this project to life. And we're gonna be talking about how this is really the first instance of chaos engineering at StitchFix. And hopefully over the next year, other teams besides just styling engineering is gonna adopt this as well. And so by the end of this, I hope we're gonna kind of walk through nine different steps to run this simulation in production. And I think you're gonna walk away with being able to hopefully run a game day on your own team. So first, I think we wanna define what type of failure we wanna simulate. This is one of the hardest parts. Where do I start? My suggestion for you is keep it as simple as possible. You wanna be thinking about the frequency and the impact of the different types of things that can go wrong within your application. So I have a few ideas on where you can kind of think about on where to start. So first, most applications can experience some sort of database failure. So if you can't connect to the database, we at StitchFix oftentimes have issues with connections to the database. Second is Flaky Container. So thinking about if one of your containers is down, how is that affecting your users? Third is external services. So we at StitchFix, I think like a lot of applications use third party providers to handle our payment systems. We use Braintree or we use Zendesk for customer support. So what happens if one of those external systems go down? And then fourth is internal services. So at StitchFix, we are moving to a microservice based architecture. So a lot of our applications are now dependent on internal services being up. And so today, we're gonna be talking about internal services because this is an issue that hits close to home. It's an issue that is not only specific to styling engineering, but is applicable to all engineering teams at StitchFix. And so the implementation that we've worked on on our team can be easily replicated on the other teams as well. And so the scenario that we're defining here today is that the service returns a 500, and that's what I'm gonna be diving into next. So we wanna implement code that's gonna enable our game day. And I think what I was talking about earlier is there's this misconception that chaos engineering is only at the server level. But I think there's a light touch way to implement that here with Ruby. And like I said, this is obviously application and team and company specific. But I'm gonna be diving into how we implemented this at StitchFix. So we chose to inject failure within our application at the middleware layer through Faraday Middleware, which is a HTTP client library that allows us to essentially hook into the request response cycle and alter the response that we get back. So at StitchFix we wrote this custom middleware class that alters the response we get back when an app requests data from an internal service. So how this looked in code. So first we wanted to start off by creating this new Faraday connection object. This object takes an options hash, which we pass in things like URL and request options. And so we're not actually doing much right now. We're just creating this basic connection object. And so what we did next is that we wanted to write our own custom middleware class. And so here we're calling the class response modifier. We have to name the method within the response modifier call, as you can see app.call. Since middleware are classes that implement the call instance method. And then so within the on complete method, we're able to alter the response ENV status code. And this is really the heart of the technical implementation of how we are running this game day in production. And so what we're doing is we're overriding the classic all good status code of 200 and overriding it with a 500. And then below you can see that we are registering the middleware as response modifier as well. And so we added this Faraday class to a gem used across Stitchfix to make service requests from applications. And so it's universally accessible. And this is kind of another thing that I would point out as you and your team are thinking about implementing a game day, thinking about how not only your team can use it, but everyone else can too. And so once we create the Faraday class, we actually want to add it to our connection object. And so what we're doing here is by adding response modifier there. Every request that goes through is going to have the status code changed from 500 to 200. But that kind of leads me to my next point. We probably don't want to actually run this in production and affect all users. At this point at Stitchfix, that would bring our company down. And we don't, this is our first run at it and we obviously don't want to have a huge negative impact on our business. And so what we decided to do is that we use feature flags at Stitchfix. We have two different feature flag tables. And we created a new feature flag called Run Game Day. And so we selected a handful of people to be part of this game day. And that's how we are running it in production without actually bringing everything down. And so what we have to add here in order to implement only certain users being part of game day is we add this new config option response modifier. And then we only pass in true if you are a part of the game day. And so we had a dozen or so people on this feature flag as we ran it. But not everyone was being affected. And then you can see this is the other part of the Faraday connection object, which is we are only modifying the status code if the config response modifier is true. So third, after we've decided on our scenario and we've implemented it technically, we want to gather expectations. And this is really important to do before we even start the game day. So we're not biased by the actual output of the game day. The way that we did this on our team is I sent out a Slack poll, which was really simple to send out. And one really interesting kind of takeaway is you can see not everyone is on the same page about what we think is going to happen. So before the game day even starts, we're already uncovering some really interesting clues. You can see the majority of the team thinks that the page will render, but the data just won't show. And then there's one outlier, Erin, who thinks the page isn't going to load at all. So now the game day's about to start. It's really important for us to huddle together as a team. I work on a remote team, so the way that we did this was over video chat. But if you work on a team where everyone is in the same place, you might want to get a war room or get everyone into the same conference room. You can also see here that I'm sharing my screen. We've started a document and we're starting to kind of dive into metrics before the game day even starts. And so next, we want to talk through expectations. So again, this is before the game day even starts. This is where we can kind of debrief on why people thought that the app was going to react in different ways. And so we had this really fruitful discussion about why certain people thought that just the data wasn't going to load and why some people thought that the app wasn't even going to load at all. And so these are kind of a few examples on where to start, thinking about expectations. So how is the app going to respond? Which pages are going to load? What is the user going to see? What is the user experience? So are they going to see error codes, spinners? What alerts are you going to get? Are you going to get bug snag alerts, pingdom? However, you kind of manage alerting within your system. Where will you find those alerts? Will you get paged? Will you get a Slack notification, an email? What will the dashboard show? What will the metrics show? What docs do you have to resolve the incident? And how will the data store be impacted? And so these are kind of just a few examples of where to start. But this led to, like I said, a really fruitful discussion about talking about before we even ran the game day what our expectation was for what was going to happen. The other kind of key part about this doc is we also had steps on how to start the game day and how we needed to revert the game day. So this doc, as we were running the game day, was kind of our source of truth, one could say. So now we're at step six and we're finally at the point where we can run the game day. So we've determined the scenario, we've implemented the code, we've gathered expectations, we've huddled as a team, we've talked through the expectations, and now we're actually at game time. One really interesting thing to note is I think some people might want to start jumping to the meeting and run the game day immediately. But when we ran this simulation, we were already at minute 40 of our time slot. And so we talked for 40 minutes through all those different bullet points, through all expectations, talking about graphs. And we've gathered all this really relevant information that's gonna help us moving forward. But now we're at the heat of it. We want to run the game day and it's game time. So the way that we implemented this is we had a rake task where we could run start and this essentially just allocated users to the run game day feature flag. And so what happens here is that those users are added and the game day is now live. And so what happens? Who is right in the poll? Is there just no data or does the app explode? And to the surprise of everyone, except for Aaron on our team, the app exploded. It was very shocking to all of us. And we got this, which was surprising. And we're already like, wow, we're already learning all these cool things. And we're really glad that this is behind a feature flag and is not actually happening live in production. And we're not having to solve it right now as a real life incident. So we see this message, we're sorry, but something went wrong, a pretty standard Rails message. We start diving into the metrics we see in bug snag that we're getting errors. We're like, yes, it's working. You can see the little blip right here, that's our game day. But we don't get paged, which is really interesting. And we start diving into that. And I think that's actually because there wasn't a high enough frequency of people within the scenario to actually run it. But this is something that we want to dive into a little bit more in the future. And so now that we've seen the app is crashing for these users, we see the bug snag is alerting us properly. We see pager duty, it's not reacting in the way that we wanted to. We consider the game day over for now. We've already learned a ton. The app is not resilient enough to handle service 500 at this point. And so we run our next command, rake game day end, and the game day's over. The users are pulled out of the feature flag, and for now it's turned off. And so for the remaining 10 minutes of our game day huddle, what we do is we revisit our expectations. So going back to this poll, Erin was the only one who properly guessed that nothing in the fix would load. I think it's really interesting to kind of point out that Erin is actually the most junior engineer on our team. My manager and a handful of other principal engineers thought that the app was going to react in a different way. And so this was really cool to see that the input of Erin and he was actually the only one who got this spot on, which was really cool. Another way to kind of potentially test this in the future is run like a Google form or survey, so it's actually totally private and people aren't influenced by other people's votes. So going back to the doc that I pointed out earlier, we have the list of expectations, the set of instructions. And this is kind of the third part, the learning. So what did we actually learn from the game day? And we spent the last kind of few minutes of our war room huddle talking about how our expectations differed from reality. And what metrics did we wish we saw, but we didn't? What runbooks could we have that we wish that we had, but we didn't? And this is really the meat of what we gain out of these simulations, these learnings. And this kind of created a to-do list of stuff that we want to focus on on our team to make our systems more robust. The next takeaway was to write and edit these runbooks. This was part of kind of the learning section at the bottom of the doc. But one, I think, really practical and easy to visualize way to think about how we want to create runbooks in the future is thinking about decision trees. So what happened? Did the app load? Can we view the client data? Can we view the inventory? Client data inventory are two kind of essential parts of our system. And can we view the details and other kind of functionality like can you put items in the cart? And from this decision tree, this is where then we can kind of pull out different dashboards instead of instructions to enable us to have more information at our fingertips while we're actually reacting to these incidences while on call. And so the next part is updating our dashboards. Kind of like I showed, this ties into the decision tree. So if something happens and it goes wrong, how do we know that there's metrics that will enable us to get visibility into what's actually happening? So we use Datadog for our metrics. We found out that a lot of people actually didn't even know how to find our dashboards in Datadog, which was really interesting. We learned that you can create dashboard lists. We learned that we actually have a list for our styling team, but a lot of people didn't know how to access it. And so kind of one big takeaway for us is, first of all, we need to make our metrics more discoverable. So making sure that people know where to find this information. But second also within the dashboards, how are the dashboards actually showing us information that we want to see? Another kind of part of this implementation that I didn't really talk about in the code implementation is that Datadog allows us to tag different metrics. And so we enabled tagging based off of the game day. And so we were able to filter only by metrics that were caused by game day equaling true. So just to summarize what we've really talked about here today is that we talked about before the game day we want to define the simulations. We want to implement some code to enable us to do so. We want to gather expectations from the team, maybe in a poll or a Google form type of format. And then during the actual game day, during the actual simulation, we want to huddle as a team. We want to talk through expectations. And then we actually want to run it. And then finally, we want to, after we've run the game day, once we've seen the outcome within our application, we want to revisit our expectations. We want to write and potentially edit run books. And we want to update our dashboards. But if we go back to one of my first slides, during my first on-call rotation, like I said, I felt stressed, I felt confused, and I felt a little incompetent. And I didn't feel resilient myself to on-call issues. Neither did the rest of my team. And so today I was talking a lot about resilient systems in terms of technology, in terms of code, in terms of happy path versus sad path. But I think there's a broader benefit to running these simulations on our team, not only to improve the way that our code reacts, but I think to improve the way that we as humans react to these on-call incidences. And so I think through game days and practicing incident response, we learn more about our systems and we learn more about ourselves, building both applications and people to be more resilient to handle outages within our own application. And I think that's the ultimate and also very attainable goal, is to build strong technical and human systems by practicing this incident response. Thank you.