 I hope you enjoyed the cheesecake and put you to sleep. I'm a back-end engineer at Heroku. And I'm talking about how you can add knobs, buttons and switches to your application to make it alter its behavior when things go wrong. We've all seen applications that can keel right over when a single unimportant service is down. So let's not have that be you. They're airplanes from the flight deck and I have fond memories of Captain Kirk yelling every week to divert power to the shields. This talk is about what kinds of levers you should have for operating your application when the going gets tough. I want you to feel like when you're on call you have that level of control over your application. So this means that this talk is about application resilience but it's only one part of the topic. This is what I call the just right talk for the, well not just right fires, but it is not about the baby fires. These are your casual everyday failures. No one action that you take on behalf of a customer has a 100% chance of succeeding. Maybe they provided bad data. Maybe there's some conflicting state whether that's between you and another service or between two other dependent services. Maybe that customer has found a particular race condition or you've hit a network blip. Whatever the reason that request and many others like it may not succeed but those are not what I'm talking about today. So this talk assumes that you have functionality for retrying requests, unwinding multi-step actions when you hit a snag six steps in. I've talked about those strategies at a previous Rails Conf and I want to highlight them because they will probably give you more bang for your buck depending on where you're at. I'm also not talking about disaster recovery scenarios. Those business ending terrible horrific catastrophic events, something like I am sorry, your database is gone, all backups have been lost and aliens have abducted all of US East. Good luck. So while this is the just right talk, it may be more useful for you at this moment to work on failures that are happening quietly right now or to plan for the ones that you hope will never happen but might end your business. And while this is the just right talk, the entire talk may not be just right for you. I've been lucky enough to work at companies that care deeply about providing a great, reliable and resilient customer experience but how we provided those services to customers reflects what we value. When you have to make a difficult choice about what you choose to do under bad situation or under bad circumstances, that choice is very particular to the size of application, your customer base and your product. So you may end up asking your product people, even your business owners, what do you want me to do in this situation? So what am I talking about? I'm talking about strategies that can help you shed load, fail gracefully, protect struggling services and we'll talk about these seven tools that will help you do that. I'll go into some implementation details for each and then I'll give you some buyer beware warnings at the end. So let's jump right in. The first tool I wanna talk about is maintenance mode. Going into maintenance is your hard nope, your fail whale. It should have a clear consistent message with a link to your status page. And most importantly, it should be really easy to switch on. At Heroku, we have this implemented as an environment variable. The key thing here is that it's one button you can press and not a series of levers and dials. You should not have to follow a very long playbook in order to get this working for you. The next one I wanna talk about is read-only mode. So most pieces of software affect, they exist to affect some sort of change in another system. I'm guessing for most of us, since it is RailsConf, that work that our application performs is to alter a relational database. But it could be any series of things. Think about what your application does for users. Whether that's store data in database, transform files, upload them into a file store or for us, if Raghu launched containers on EC2 instances. Once you have an idea of what your application is modifying, think about what you can do if you can't modify that. What questions can you answer? Some of you may be operating a very narrowly scoped service and the answer may be nothing and that's fine. This is not the tool for you. But if you're at the classic Rails blog size, maybe larger, this can be very useful. Most people probably just want to read your blog. They don't want to alter it. They're not publishing. And then for my job currently, the primary application I work on has a variety of disparate services. So we need finer grain tools. So this is not quite the tool for us but it's a good first step. Again, the way we would probably implement this is through an environment variable. And that's mostly just because it would have similarity to the maintenance mode. But consider what tool you want to use and use it consistently. Next, feature flags. So feature flags can be used for more than just new features. They can allow you to provide a controlled experience when part of your app isn't working. So imagine what if billing or selling new things was a new feature flag for you? There are different levels of feature flags that we find useful. First is the individual feature flag. This probably isn't very helpful for you during an incident. Hopefully your incidents aren't called just for one user. There's the global level application-wide. So as I mentioned, what if billing globally was a feature? For us that might be also like freezing modifications to all the containers we're running for customers. But what we really find useful is the group level. So at Heroku, we run user's applications on our platform. So for us, the most relevant groups are usually groups of applications running in a particular region. They might also be groups of applications written in Go or Ruby. You'll want to think about what groupings are meaningful for your business because it really ends up being a combination of what you want to control and who your users really are. So the way we implement this is we have a class that can answer these questions about current application state. This could be a normal active record model talking to a relational database. For us, that is currently what this particular class does talk to. It does talk to our database. But that's not necessarily the right choice for you. This model could be backed by talking to Redis. It could be talking to an in-memory cache. Of course, an in-memory cache would mean that for each different web process, it would have different application state which might be more complicated than you want. One of the most interesting options I could think of was curling a file named billing enabled in a particular S3 bucket. If that is what you need to do in order to make sure that this check doesn't fail when the thing you're trying to handle the failure of also fails. Sorry, too many failures in that sentence. But you would want to choose something that is going to be able to answer the question of am I down if the thing you're trying to control is also down. For groups, so the previous one we looked at was billing enabled. And this right now would be looking at the setting for billing for REU customers. So for groups, I really recommend having one switch for an entire group. It may seem silly to have these strings, just tons of them. And this may not be your experience, but at 2 a.m. we find that strings really are easier to copy and paste rather than trying to instantiate an application setting model for billing and then say that it's for the U.S. group and then toggle the enabled flag. Just a string works a little bit better for us. And really gives us more confidence that when we ask what the currents say the application is, we know exactly what we're getting. Next I want to talk about rate limits. So rate limits protect you from disrespectful and malicious traffic, but they can also help you shed load. So if you need to drop half of your traffic to stay up you should drop half your traffic. Your customers that respectful traffic they may have to try two or three times to get a particular request through. But if they keep trying they'll be able to do what they need to do. We see this strategy from AWS all the time. When we understand that we are in that state that they are rejecting a fair number of requests because they are under some sort of load we start behaving in a way that is helpful to them and to us. We stop sending excess traffic and we start repeating our single most important request to them. And then our second most important request eventually gets through. Eventually that most important request to them will be accepted by them. And during that period we won't have been sending them tons and tons of traffic. Rate limits can also help you protect access to your application from other parts of the business that rely on you. Oftentimes the single application that a user sees is actually a mesh of a sort of different services all acting together to create a single user experience. And while perhaps you can, I mean you absolutely can make that internal system function when other parts of it are down it can be easier to just really try to protect that preferred traffic in addition to the strategies here which will help you stay up even if parts do go down. So if you can prefer your internal traffic it can help continue to present that unified front to customers and keep you looking up for longer. So we implement rate limits as a combination of two different kinds of levers. Single default and many modifiers for user accounts. And we find that this gives us the flexibility to provide certain users the rate limits that they need while at the same time retaining a single control for how much traffic we are able to handle at any one point. So where we start is again an application setting. This is a global default of a rate limit. Here we're saying it's 100 requests per minute. Hopefully we can handle more than that but let's just say that for easy math. And we have our customer here. Our customer starts with a modifier of one. And what this means is that to determine the customer's rate limit we will multiply the default 100 by their modifier of one. Which results in a rate limit as you might expect. Yep. Sorry, there we go. Of 100 requests per minute. Now let's say this customer writes in and says, hey, I really have these legitimate reasons that I need twice as much traffic. We say great. We'll bump you up to two. To a modifier of two, which means at the end of the day they get a rate limit of 200 requests per minute. Sometime later we end up under a lot of load and we're not able to keep up. And we make the tough choice to say, hey, we need to cut our traffic. And so we cut the default rate by half. So this used to be 100, we're now at 50. But what this means is that all of our customers, all of our accounts actually, including the preferred internal accounts can in one setting be cut in half. So our customer here is back at 50% of her rate limit, 100 requests per minute. But that's still a little bit above or significantly above the default. So it allows us to rapidly cut traffic coming in without having to run a script over every single user to adjust their rate limit. I should mention that depending on your application, you may want to consider doing cost-based rate limiting. That may be a far better choice instead of doing request-based rate limiting. So in cost-based rate limiting, you're going to charge a user a number of tokens depending on the length of their request so that they can't call your really slow endpoints as frequently as your blazing fast endpoints. This is helpful if you're doing request-based rate limiting and then you drop your users to maybe 50% of normal traffic but they're still hitting that one horribly unperformant reporting endpoint because it's the end of the month and everybody needs their stuff. You could still be under excess load and you might want to consider cost-based limits if you have a lot of reporting endpoints that really tax your application at particular times. Finally, it may seem counter-intuitive but the more complex the algorithm for rate limiting, the worse off you will be for denial of service tax. The more computation time it takes for you to say that you can't process a request, the worse off you are when you're dealing with a flood of requests. This is no reason to not implement some complex rate limiting if you need it but it is a reason to make sure you have other layers in place to handle distributed denial of service attacks and honestly even the denial of service attacks that happened by mistake when someone just makes a, deploys a bug and you're getting hit over and over again. Next, stopping non-critical work. So let's say you're hitting limits on your database, maxing out your compute, hitting the limits of some other dependent service. You should be able to stop any reports, any cleaners, any jobs that are making this worse that don't have to happen in the next hour or may don't have to happen in the next four hours. You should be able to just turn them off. So how do we do this? So like application setting, we have report setting as a model here. Similarly, it takes a string and what we do is we make sure that every report, every job checks to make sure that it is enabled before it runs. So let's look at a quick code example. So let's say we have a monthly user report and that responds to a run method and it's gonna do something. Who knows what it does but it has a decent chance of being very intensive. All right, so before we do any work, we're gonna check to make sure that we're enabled. For our monthly user report, we're going to implement a method called enabled. We're going to check report setting and see if this particular report is enabled at this time. But let's make this a little more general. So let's make your monthly user report inherit from report. And then let's say monthly user report is really just responsible for building itself. It doesn't know much else. It's not really gonna be responsible for knowing where it's gonna run, whether it should run, it just can build its report. Which means the parent class report will then get a bunch of additional features. So it knows how to respond to run and it can figure out if its child class is enabled. So it's really useful for reports and jobs. Having this just standard means that anytime a user creates a new job, it is by default able to be enabled or disabled with one change. Two, in our case, the database, but Redis, that S3 bucket, whatever you wanna do. So next, known unknown. So I am confident that all of you have never shipped non-performant code, ever. But I definitely have. The SQL that you're shipping, they don't know how it will perform for your biggest customers. That you might wanna have it under control if it does go haywire. We've plenty of new features that go out that we think are fine. We've done as much testing as we think is reasonable, but there's still the hair on the back of your neck. So if you're scared of it, put a flag around it. If it's a new feature, we'll put it in a feature flag. It's pretty straightforward and we covered that a little bit already. But if it's a refactor, we usually have anything scary go out within GitHub's scientist gem. So using scientists allows us to gradually roll out changes, refactors, but it also allows us to enable or disable the experimental code immediately if we see any problems. And the great thing is because it's so fast to disable, we can do it even before we're 100% confident that this is what's causing issues. And the beautiful thing about having so many things configurable is if you have a little bit of doubt, you can just turn it off. And we find that eliminating those rabbit holes, things that might take one person an hour during an incident to look into to prove that that's not it, is really helpful. We all have biases that, you know, I know that person's code, that's not gonna be preferred. Like that has to be it. Well, maybe it's just that a change went out right at the time that you saw the issue. Being able to turn suspect things off is a really great tool to moving you closer to the real problem faster. Finally, I wanna talk about circuit breakers. So circuit breakers allow you to be nice to the services that you depend on. They allow you to be a good neighbor. They allow you to not break them. And they allow you to not swamp those services as they're just recovering. So circuit breakers typically are responsive shutoffs. So responsive shutoffs, look at all the calls you're making to a particular service. And they can be configured to look at particular metrics. So whether that is the number of timeouts over the course of five minutes or maybe it's a 50% error rate over 10 minutes. Whatever you've configured them to look for, responsive shutoffs can automatically kick in and back off any calls to those services. That gives those dependent services or services you're depending on time to recover, but it also frees up your web processes to not spend the time calling down to a service that is most likely failing. Responsive shutoffs work far faster than any monitoring service. They can go through paging your on-call person, getting them awake, getting them on their computer, and then having them look up the right playbook and then take action. So the hope is that by the time you page in the on-call person, the responsive shutoff has already kicked in and you're in a better failure mode. But you can also use circuit breakers in a hard or a manual shutoff. So this would help you specifically keep traffic away from a struggling service. In some cases, you might want to allow high latency. Let's say, maybe you have a 29-second request to a service every once in a while. You don't want that kind of request to trip the circuit breaker, but that does mean that if that service is in a high latency state where it's taking 29 seconds to respond to every single one of your requests, that means you're probably grinding to a halt since your web workers are going to be tied up, trying to resolve those downstream calls on behalf of your customers and not servicing the massive backlog of requests that you have coming in. So in that case, while you wouldn't want a circuit breaker to automatically trip, you may want to manually turn it off. The other nice thing, or the other use for these manual shutoffs is a misbehaving service. This is usually for internal services where engineers can be a little more honest with each other. If you have an internal service that's responding 204 and you know it's just dropping requests on the floor, a 503 error can actually be better for your customers than allowing those two services to drift out of state or telling your customers that something's gonna happen and it never does. So these would work in a similar way that our monthly billing report worked. In the same way that monthly billing report inherited from a report class, our billing service client would inherit from a client class that would set up by default circuit breakers for any of its children and would keep track of those individual circuits, again backed by anything you want, whether that is in memory state, shared cash, data store. There are a number of good circuit breaker gems out there that you can just include and will have support for this. And so I won't get into implementation too much just because please go read their readme's, they're lovely. With all these approaches, I would highly recommend writing tools to manage these circuit breakers that do not assume a developer typing into a production console as I have here. A case in point, how many of your on-call engineers know and remember enough about electrical engineering at 2 a.m. to confidently remember whether open means sending requests or not? If you watch me giving this talk at RubyConf Australia, these will be flipped because I did not remember. So in this case, a circuit being open would mean that communication to the service is closed. So I would highly recommend writing a tool that allows your on-call engineers to see whether service is off and turn it off or on or some other vocabulary that is universal and hard to misconstrue, hopefully not at 2 a.m. but if needed at 2 a.m. All right, so I wanna talk a little bit about implementation. With all these buttons and switches, you really want to consider carefully how you form them and where you store their state. You have a number of options, some of which I've listed here, you can store them in a relational database, a data caching layer, in environment variables, you can even have them in your code as a last resort if you think that a way to control for failure is a deploy, then absolutely, have a place in your code that it has a comment that says, hey, change this line and then push it out to production as quickly as you can. For us, we're a deployment platform running on a deployment platform, so usually that option is not available to us. But it gets at this point of consider whether flipping a switch would require access to a component that could be down in a way that you would want to use the switch to control. So it doesn't require access to a running production server and what happens if you can't communicate to the running production server? How might you change the behavior of your application if you can't deploy changes? If you have immutable infrastructure that might mean environment variables are totally out of the question for handling certain failure cases. One of the reasons why we rely so heavily on databases for storing our application state is because we have high confidence that our wonderful Postgres team, thank you, Gabe, will be able to get us access to the database in order to manually run SQL statements to flip certain bits. And in many cases, that would be how, instead of the lovely Ruby code, in many cases our failure states would end up being us running SQL in order to flip a switch to allow our running, still running, but behaving poorly application to discover those changes. So a final note is to really consider how much work it would take to figure out if a switch is flipped or not. Because in general, the fancier and more sophisticated your switch is, the more likely it is to become part of the problem or to confuse your engineers such that it is eventually the entire problem. And with that, I promised you some biobiwares, some caveats, so here they are. First is about visibility. So you remember this picture? Yeah, so we've built a lot of knobs and switches in this talk, but you haven't actually seen a dashboard. That's because you'll need to build one, whether it has a lot of pretty graphics or just command line output. Having something that can pull the different places where you're storing the state for your buttons and switches and combining it into one comprehensible place is really important for incident operations. Clearly understandable is not a bar that we meet at the moment, but we can discover the state of every single switch, even if it's just, it's way too much output, but we're working on that. Next, does it actually work? How many of you have tested your smoke detectors in the last month? Excellent, all right, congrats, game. You may end up being surprised at your dependencies, and more interestingly, you may actually be surprised at the dependencies that your dependencies have, especially if you're working with vendors, they might be running on the same infrastructure that you are. So if it's a critical switch, perform game days. You really don't have the confidence to know that it will work until you've really tried it. Of course, with that mention of vendors, like you can try to turn off certain things, but if you don't have complete confidence about what other people's work is built on, it can be really hard to kind of tease out what exactly you need to turn off in order to simulate a complete outage of a particular component. So this leads me to my next and final point. Really trading knowledge for control here. The more configurable you make your application at runtime, the less confident you can be that it will work in predictable ways. Have you tested for this user when flagged into three things, flagged out of two, and with a service shut off? I'm guessing not, and if you did test that, I'm questioning the size of your test suite. And more than just unit tests, keeping production staging and development environments in the same state has been a problem for many of the teams that I've worked on and I don't know of a good solution. And yet while you are trading knowledge for control, I'd still take this deal any day. I'd rather have control over my app to mitigate issues than to know confidently the exact and particular way that my app is down and have no way to do anything about it. So that, thank you. I hope that this has given you some ideas about ways you can make your application a little bit more resilient to the fires you inevitably will see. I do work at Heroku, we have two lovely other speakers. Stella starts tomorrow morning, right after the keynote, followed up by Gabe. So if you wanna learn about using Kafka in Rails or Postgres 10 is gonna make your live awesome, please check us out there and obviously come by our booth which opens tomorrow as well. So thank you and I am happy to take questions for seven-ish minutes or have y'all disbursed. Yes, yep, yep, yep. So the question is, when do we start thinking about adding in a new knob or switch? I, usually after an incident. Some of them are longer term, more thoughtful things, but yeah, at this point most of the new ones or something went wrong and we didn't have the ability to control it so we don't wanna have that happen again. Yep, how do we train new developers? I think that ties in to how do we onboard people into on-call. So we rely on a couple things, first of all, shadowing. We really do try to get people comfortable with the idea of being on-call by having them shadow on-call engineers during the day. So they're not getting any pages but they're in there, they're seeing the person's screen. We do have documentation but honestly that's like one of, if you're in a really tired state, the likelihood that you are going to think, oh, let me read through these 50 pages of documentation is next to nothing. So we really want to get people to the point that they know what they're searching for through our docs and through, I mean hopefully with a playbook, it's a little bit more directed, you know which ones you're looking for. But again, if you're in a kind of an information discovery phase during the incident process, we're probably like four or five hours in. So yeah, lots of shadowing, encouragement to read the docs and then a strong reliance on telling people they really should just page someone. As the secondary for a new person going on primary, I am more than happy to be woken up. It's just it needs to happen every once in a while and I want them to feel supported. So making sure that they are totally okay regardless of the hour and that I am relatively chipper when I am in fact paged in is really important to us. So yeah, no magical system to it but just making sure people feel confident and aware of things. Where do we store state? Okay, so as I mentioned, we primarily store state in Postgres for us and also in Redis because again, we have confidence that our data team is, just because of our infrastructure is relatively separate enough that if something catastrophic has happened to us, most likely they're gonna be able to get us in. All right, well I see people queuing at the back for the next talk so thank you very much everyone. Yeah.