 Right now we're going to listen to Circuit Breakers by Scott Trillia. Well thanks everybody for coming out, so let's talk about Circuit Breakers today. Before I get started, I want to give a quick shout out to Yelp, my company, who is kind enough to send me here. If you don't know, Yelp is a website where you can discover excellent local businesses wherever you happen to be. We've got a booth outside and we're happy to talk to you more about what it is that we do. And we're a pretty big website as well, so in particular we've got 90 million monthly users, over 100 million reviews, which is something that a lot of our code deals with, and we're launched in 32 countries, including lots of Europe as well. As for myself, I spend my time working with revenue infrastructure, so the kind of code that helps you purchase products on Yelp if you're a consumer like you or me, or a business owner as well. And if you're inclined toward Twitter, you can find me at Scott Trillia. Alright, so with all that out of the way, let's go ahead and talk about some Circuit Breakers. In the old days, we used to have this very nice model of the world, right? Pretty much any code that we wanted to call, whether it was something simple or something complex, was executing in the same Python process that we were calling it from. So if I want to do something nice like display a business's name in a nicely formatted way, I call a function to do it, and I can be pretty sure that that function's going to execute and do so cleanly. Unfortunately, more and more these days, that assumption that code is running locally and code is executing in process is not becoming true. So a lot of different trends are causing this. Things like Docker that you may have heard of, the microservices or service-oriented architecture movements in general, are introducing more and more network boundaries into our code. So that function that used to be executing locally in our same interpreter is now maybe on some remote server running elsewhere, we don't even know where it is. And the unfortunate reality of this is that can introduce a lot of problems, right? Problems that didn't used to exist. So things that we assumed would always work, all of a sudden have strange failure modes, maybe I get a 500 from some sort of HTTP API, maybe I just see a lot of slowness and I have no explanation of what's going on. There are a lot of ways that this can go wrong. And that is essentially where circuit breakers fit in. So circuit breakers are these components that we build in between ourselves and our remote dependencies, and they're essentially there to do two things. They need to detect when the system is unhealthy, when they can't communicate to that external API, whatever it is, and they need to do something about it. So often this is something like blocking requests or turning off features or etc. So wherever you hear circuit breakers mentioned, you're likely going to see references to this book by Michael Nygaard, release it, has a lot of ideas in it, a lot of ways to build good software, and one of them, in fact two pages out of an otherwise fairly large book, is dedicated to this idea of circuit breakers. So we have three goals today. First off, I'm going to introduce Nygaard's circuit breaker, give us an idea of what it does, what the basic idea is, and why we should care at all what problem it solves for us. And then we're going to try and walk through a variety of ways that we can take that basic circuit breaker and expand it to solve different problems. We'll sort of motivate this by talking through a few scenarios, getting an idea of ways that that initial circuit breaker is just insufficient, and what we can do about it. And to do all of this, I need to first introduce you to my own favorite restaurant back home in San Francisco, which is Jim's Diner. So Jim's is a very old school American Diner experience, so this means vinyl booths, you sit down at tables and you have waiters that come and take your order from you and take it to a back kitchen. And it's that kind of process of ordering food that we're going to use as a stand-in for a lot of different things that you might want to do with remote services. So let's briefly walk through the ordering experience, just in case it doesn't translate countries very well. So you're going to start out with the menu, essentially, you'll sit down at your table and be given a list of items you might order. Step two here is you're going to go ahead and decide what it is that you want to eat, and they're going to take that with the waiter and they're going to write it down on a ticket. They'll take this ticket and give it to the kitchen and say, please make this order. Step three is going to be, hopefully, the kitchen processing your order, nice and efficiently. And step four is, of course, whoever the customer is gets their food back and they get to enjoy it. So the fundamental rule of any of these kind of modern network-oriented architectures is you have to accept that your system is going to fail. Now, that may not be your code in particular, but it turns out that because you depend on these things, many of them may be significantly out of your control, something is going to fail. And it's not a question of whether you can avoid the failure entirely. It's a question of when it does occur, what are you going to do about it? So let's look back at that example of eating food at Jim's restaurant and see how failure might factor in. So we've gone ahead and ordered our food. We've had the waiter take it down on a ticket and placed it in front of the kitchen, the cooks. And something unfortunate might occur, right? It's something we don't totally understand. Maybe we aren't even aware of it, but that order that was supposed to get processed in a timely manner by the kitchen is instead entirely forgotten about. Now, in a real restaurant, you might hope that the reaction to this would be something reasonable, something like a waiter coming and telling you that the kitchen is far behind and that they need to take more time for your order. But in the code that we actually write in practice, we have very little thought to this failure mode, right? We might not even think of it at all. And so it turns out that the real result of doing this can be really negative. In fact, it can be completely preposterous. And if we ever said it out loud, we'd say, wow, that's a really bad situation. So essentially, our goal today is to make sure that we're aware of this problem and make sure that we've built in systems that can handle it automatically. And the first way that we're going to talk through that is Nygard's basic circuit breaker. So here's a slightly idealized schematic of Jim's restaurant. This is approximately true, and we'll see why it's designed this way in a moment. On the left there, we have a number of sections. One section contains several diners in it, but most importantly, a section is served by a single waiter. So they're going to come to the individual diners and collect those orders we talked about earlier. They're going to take them across to the right side of the diagram and drop them off at the kitchen. There are a bunch of cooks in the kitchen, and as they become free, they're going to pick up the next order, cook it, prepare it, and hand it back to the waiters to bring back to customers. And I had said earlier that this is sort of a general stand-in for a number of things. And so I want to briefly convince you that this is a generic kind of interaction model. Here's what it might look like if instead of, you know, customers at a restaurant, we were talking about services, back-end services, and some sort of HTTP-based API. And likewise, if you're dealing with task queues or any sort of slow back-end process with workers, you can fit that into this model as well. So all of that said, we're going to focus on the dining use case. And what we're going to essentially do is build-in circuit breakers to this model. The ones that we'll start out with are going to be on each waiter individually, so every waiter will have their own circuit breaker paying attention to the success or failure of their orders. And in Nygaard's basic model, there are three essential states. Now, on the right side of this, these are the traditional names that Nygaard gives to these states. I don't know about you all, but I find them horrifically confusing, and I'm not going to use them. So if you're a purist, you can look to the right side of this diagram, the ones on the left I find them a little more intuitive. So healthy is the good state, as you might imagine. Essentially, if all the requests are flowing successfully, the circuit breaker is going to recognize that things are going well and declare itself healthy. Unhealthy is the exact opposite, of course. If no requests are succeeding whatsoever, we're going to declare the system unhealthy and hopefully take some corrective action. And Nygaard's circuit breaker has kind of an intermediate state where we maybe aren't quite sure whether we're good or bad, and we want to decide between the two. So a very basic circuit breaker in Python could look something like this. So each waiter is going to start out by asking themselves, do I believe the system is healthy right now? In the good case where it is healthy, they're going to go ahead and just send the request onto the kitchen, they'll take your order, and they'll make sure they pay attention to whether it succeeds or not. In the bad case at the bottom, we're going to go ahead and just block the request up front. So I'm going to go ahead and read my order as a customer. I say I would really like to have maybe bacon and eggs, and the waiter's going to say, sorry, kitchen is unhealthy, can't accept your order. And we might be in this middle case that we talked about, kind of recovering, trying to decide where the system is. If we are in that middle case, again the basic Nygaard circuit breaker says that we should be taking this kind of approach. We're going to start by waiting a certain amount of seconds, maybe one second, maybe five seconds, and this is to give the backend system hopefully some sort of time to recover. And then the very next request that comes through, we're going to pass it to the backend, we're going to see if it's successful, declare ourselves healthy, and see if it's unsuccessful, and declare ourselves unhealthy. And we kind of repeat this cycle until we become healthy again. So that's the basic circuit breaker, not terribly complicated, but let's kind of explain why it's actually doing us any good in that form. So before we had a circuit breaker active, one effect of the kitchen slowing down was that diners would start to wait a very long time to get their food. We saw our skeleton friend earlier who had waited an extraordinarily long time, and the way that most of us often write this code in practice, there may not be any time out of any sort on the client side, so they just wait forever. On the other hand, the kitchen, the backend, is also getting a growing backlog, right? Maybe they normally have five orders in process, and as they get slower and slower, they have 10, they have 30, they have 100, and they're hopelessly behind. And every new diner that's entering our restaurant is actively making the situation worse. So maybe in a normal operation, we expect roughly 10 people dining at the same time, but if everything starts slowing down this way, we're going to get 20 in the restaurant at once, 100 people in the restaurant at once, kind of compounding all the same problems. And a circuit breaker does make this better. So for starters, we actually get fewer frustrated users, and that might be a little counter-intuitive given that we're blocking their requests immediately, but it turns out that in most use cases, that's actually preferable to waiting, knows how long, and then being told the same information. We're going to reduce load on the backend, so the kitchen, rather than having those orders pile up, can get a little bit of a breather and recover whatever the problem was. And maybe most importantly, we have a place in code where we can define a failure mode for our system. So we can say, if the system is unhealthy in some way, I know exactly what's going to happen, and it's in this part of code. And it turns out that that's actually really valuable as we start to expand and make this more complicated. So that is the basic circuit breaker. I kind of want to point out a few assumptions we're making, and we'll use that to launch in some more detailed discussions. So the first thing that we're assuming is that all those waiters had independent circuit breakers, right? And that introduces the interesting problem that one waiter can believe the system is unhealthy while the other waiter is completely clueless and happily sending orders through to the kitchen. And we might think to ourselves, that's a little bit strange, and we'll talk about what we can do there. In addition, in the basic circuit breaker, we have exactly one thing we do when the system is unhealthy, and that is we stop future requests, right? No other reaction is taken. And we'll talk a bit about what else we can do there. We also made the decision that the circuit breaker component itself, the waiters in our case, was the only thing that could decide if something was healthy or unhealthy. That's a little bit limiting, and you might imagine that, say, a cook in the kitchen has a pretty good perspective on this, and so we'll talk a little bit about pulling that information in. And last but not least, the whole recovery system in the basic circuit breaker is very defined, and it's very focused on the success or failure of one particular order. So we'll talk about expanding that a little bit. So let's start off with the first question, which is, should all these waiters have independent circuit breakers, or maybe should there be a little more communication between them? And we had said originally that one of the weird side effects of that is they can all disagree, right? If the waiter for section one is well aware that the system is unhealthy, they won't be sending any more requests to the kitchen, and that's a good choice for all the reasons that we talked about previously. But of course the waiter for section two may not have had any orders recently, and the next five people that come through section two are going to get very slow, very unsuccessful results. So the obvious solution in this case is we need to have them communicate somehow, right? And one easy way to do that is to just take whatever state it was that they were maintaining, maybe the number of successful orders and the number of failed orders, and put it into a shared data store. This data store can be any number of things, MySQL, it can be something more key value oriented, Cassandra, MongoDB. But the goal is that they're going to all communicate together, come to some roughly shared consensus on the state of the system, and that will be what they all pull from rather than their own opinions. So the new behavior we've introduced, obviously all of these clients, these waiters are now going to be communicating where before they were completely independent of each other. And that has a lot of different sort of side effects that may or may not be intentional. An obvious good thing that we get out of this is that we're going to propagate any failures in the system a lot faster. So rather than every section independently having to rediscover the fact that the kitchen is really unhealthy, they'll teach each other about it as soon as any of them know. On the downside, this pulls in a lot of complicated questions around building a distributed data store in the first place. This is a short talk, and I have nowhere near enough time to dig into all the complexities here, but a lot of the issues around the cap theorem and deciding whether you want to be highly consistent or not all get pulled in as soon as you decide to share a central data store like this. All right, so let's talk a little bit about sort of the second question we had proposed, which is given that the system becomes unhealthy, what do we do about it? And recall that in the basic circuit breaker, the only action taken is we're going to block future requests. Now you might imagine that this feels a little bit substandard because if I'm a customer, I come, I sit down, I look at a menu, I take my time, I decide what I want, and then at the very last minute I get told, oh, I'm sorry, we're totally unavailable right now. It's functional, but it's pretty frustrating for all your users. So what we're going to do is take that method that we had discussed, sort of deciding if the system is healthy or not, and maybe that's coming from one waiter, or maybe it's coming from a central database of some sort, and we're going to make it public, make it an API to whatever other code wants to consult it. And so instead of only the action we're taking being to block orders right when they're made, we can maybe make some slight improvements. Like we can say, don't let anybody else sit down at a table once the system is unhealthy. We can also imagine building out a lot of other features along this same vein. And so the new behavior we can introduce is we've essentially made the information about healthiness public, and we can build any number of features on top of that. Maybe that looks like shutting off access to whatever this feature is when the system becomes unhealthy. Maybe it even looks like really automatic monitoring, so we can go ahead and have some sort of, you know, manager be notified immediately when the system becomes unhealthy. And the advantages we've gotten here is we obviously have a lot of flexibility. We can build out any feature we would like on top of this. And the only downside is we have to ask ourselves some hard questions about consistency of that information, and that sort of gets back to the questions we had asked in the previous section. Okay, so one other question we might say is we had said before that waiters are in a slightly awkward position to decide whether everything is healthy or not, right? All they can see is the request that they make and whether or not they come back. And we might imagine that, for instance, in our scenario, something on the back end, the cooks, might have a really good perspective here. So we can have them implement some relatively simple function, and they can say, you know what, we're in the best position to tell if the system is unhealthy, if we have way too much work to do, and we know we're hopelessly behind. So we can go ahead and have them determine this and send a signal to the circuit breaker, even though they don't own it, so behavior that we've introduced, if we allow this, we can basically say anything in our broader system is allowed to communicate information, it's an opinion to the circuit breaker. And we have to come up with some way to combine that, right? If we're having waiters disagree with the kitchen about whether the system is healthy, we have to make some decision, and we need to decide who gets to win in that disagreement. So on the upside, we've introduced a whole world of new signals. This is a very wide-ranging and powerful tool to have, and on the downside, we've created a tool that's really powerful and complicated. So I've personally written some circuit breaker logic that is really, really hard to untangle after the fact, and I've seen myself in production situations where something is on fire and I have no idea why. I can't really recommend that approach. So keeping this simple enough that you can understand it while complex enough that it actually does the job is tricky. All right. And the fourth question we're going to ask ourselves is what are some alternatives to recovery? So in the original formulation, we had said that we were going to wait a specific timeout, maybe five seconds, and then we're going to issue a single request, and if it works, great, we're healthy, and if it fails, great, we're unhealthy. So we're sort of trusting that single trial request to tell us the right information. One alternative we could take on is to try something called dark launched requests, and essentially what these are is we're going to take live user requests, we're going to block them just like we would have in the normal circuit breaker, blocking them and forgetting about the request entirely, we're actually going to pass it through to the back end. We're going to run it through just like a normal request would be, and we're going to pay attention to whether it succeeds or fails just like a normal request would. Now you can imagine that this is really, really nice when it works. We have live user traffic with all the peculiarities and features that that might have, and it's going to tell us exactly when our system is alive or in trouble. But on the downside, it's thoroughly incompatible with certain systems. So if I'm running a search engine, I don't mind doing this. I'm happy to tell a user their search isn't getting processed, but then actually do it. But on the flip side, I work with credit card processing, and I think you would hope that if I told you your order hadn't been completed, I would not in the background be silently charging your credit card. So there are plenty of cases where you have these kind of side effects you really don't want to be telling users that things are a certain way when they're not. One alternative for those kinds of situations is what's called synthetic requests. So, as before with dark launching, we blocked real user requests and then we secretly did them in the background. What we're going to do here is still block the real user requests, and then instead of processing those, we're going to process something fake that we've made up ourselves. In the case of the diner, you might imagine that every, let's say five minutes, you ask the kitchen to make you a salad. This is a fake order. No one's going to eat the salad. It might be a little wasteful. But you can go ahead and figure out if those salads are getting made successfully, and the kitchen is probably doing okay. Obviously, downside here is there's no guarantee that those synthetic requests represent real user traffic or patterns or load or demands on your system. In the case that we're asking the kitchen to make salads every five minutes, if the grill is completely destroyed and they can't make a single hamburger, obviously those salad requests are going to work just fine. Everything's going to report that your kitchen is doing very well and then the moment that you decide to send a real user request and try and make a hamburger, it'll break again. So synthetic is easy to do, but maybe not accurate. And these new behaviors give us a lot of opportunities for traffic to determine health. They let us really remove those timeouts and not worry about tuning them or creating them. But on the downside, we've talked through a couple of reasons why maybe they're not appropriate for every use case and we need to think a little bit before applying them. All right, so let's wrap up a little bit. If you only take two lessons away from this talk, hopefully the first one is that if you have networks in your system, if you're doing microservices or Docker or any number of other technologies, you need to plan for failure. And not only do you need to plan, you should not be afraid of taking the basic solution, the basic circuit breaker, if you don't have time to build anything else. It is a limited approach, but it does work and for the reasons we talked about in the beginning, it's a lot better than doing nothing at all. That said, if you do have some extra time and if you can devote it, there are a lot of ways that you can customize it even better for your personal situation. We had talked about maybe how many circuit breakers you should have. Do we want one per set of processes? Do we want them to communicate? Should they be completely independent of each other? We had asked maybe what should we do in response to the system becoming unhealthy? Do we want to go ahead and just block requests? Do we want to trigger some other features? Do we want to prevent users from ever even seeing this feature? We had wondered out loud maybe if the circuit breakers themselves aren't always in the best position to decide if the system is healthy or unhealthy, and we might need to integrate multiple sources of information and make some final judgment. And then finally, we talked a little bit about alternative ways to do recovery, all of them traffic-based and some of them more accurate than others. And I should say there's a lot more here. Obviously we're a little time constrained, but there are many, many ways you can customize circuit breakers. They're more a pattern of thinking than they are a specified implementation. And everything comes down to your use case. So depending on what you're looking for, a lot of different forms of this kind of approach might be the right answer for you. Hopefully this is giving you some tools and ways to think about it that you'll find useful going forward. And with that, I think we have a little bit of time for questions. If I don't get any time now to talk to you, happy to chat after online, or I'll be out at the Yelp booth. Thanks. Good talk. Thank you. I have a question. I've seen a lot about circuit breakers, and I saw that most of the implications are in the API gateway side, because we have a main central way of getting requests. I just wanted to know if you did it and how you did it. So you said that most of the circuit breakers you saw were... Sorry. In the API gateway. Ah, yes. So you certainly can build them into any part of the system. It's very, very common for them to live either in the clients themselves or maybe somewhere between clients and those APIs. I would say all of the use cases I've personally had, they've lived pretty close to clients, but I think there's nothing fundamental that requires them to be there. That's certainly one of those questions that's kind of in the much more section, right? An interesting thing to think about what you get by putting it in clients or on the server side or somewhere in between in a smart proxy. Sounds very interesting. Be happy to talk with you after about it, yeah. Hi. I think in a production system, you would have probably 10 or 20 kitchens, so the thing would become even more complicated. How do you compare using this kind of HTTP request or whatever communication layer you have with circuit breakers? How do you compare them with having something like RabbitMQ or some queuing system to the couple? One thing or the other. My experience, you remove a great part of this complexity, but maybe other complexity arises. Sure, sure. If we have a system that's not maybe strictly HTTP-based, RabbitMQ, any number of task queues, is there any point to still doing this? I can say that for my own work, what we do is not HTTP-based, because the nature of dealing with money is that you would rather a network blip, not forget about your order or your credit card purchase. We're mostly Q-based, so we do a lot of work with Amazon's SQS, other technologies in that vein, and even in those cases, you still want this behavior. That solves some sources of unreliability, but if I write bad code in the back end that silently drops every request, we still want to identify that and take corrective action, and what we do is block future requests. Better transports are better, they solve some problems, but still relevant for those. Great talk. Say I need my waiters to synchronize over storage, but what I do is I introduce another moving part, because the CB store is another thing that could fail possibly. How do you think... What's the best way to deal with that? Because if that fails, I cannot even determine whether anything is healthy. Absolutely, and that's a great observation. When I had originally talked about this with some colleagues, I had taken it as an assumption that you would have some central data store, because why would you want these to be independent? And a colleague of mine on the operational side of things basically said exactly what you're describing. If your goal is for this to be a bullet-proof failure mode, so this is maybe the traffic that serves the entire website, any amount of coupling to another failure point is a bad thing. Any amount of coupling across processes might even be a bad thing. So there can be cases where you would actually accept a slower response in exchange for a simpler setup. That's a very good way of saying it. Okay, I see one more question over there. Hi, thanks for your talk. One question. How do you reconcile this with load balancing? Like if you've got multiple kitchens and you want to maybe when one kitchen goes wrong, load balance to other ones, and how do you make the circuit breaker interact with that? Yeah, for sure. I think in some sense they can be decoupled. If your circuit breaker is oriented on the client side of things, you maybe are completely unaware of load balancing. That's something that's secretly implemented on the back end. In cases where load balancers are doing their job, detecting unhealthy instances and removing them to a client, that looks just like everything is going well. So in those cases, the circuit breaker would not notice a problem because there isn't a problem. But it's a good sort of backup mode. Again, there are going to be plenty of other cases where maybe your load balancers are over eager. They remove everything from the back end, and now you're going to be really excited about having a circuit breaker in place still. Okay, I think that was it. Thank you very much. Thank you very much.