 My name is Simon. I work for Shopify with an H. I work on the side reliability, performance, and infrastructure. And what I want to talk about today is how we created a big application and it has many moving parts, and these parts fail, but somehow the entire thing manages to stay up most of the time. And we've learned a lot from this. And what I want to talk about today is give you some resources from what we've learned and equip you with some of the capillary to reason about these things. So Shopify is a company that helps people sell things. So we make commerce easy for people to sell things on online or in their brick and mortar stores, on Pinterest, on Facebook, and really anywhere. And we're becoming a big company. We have a lot of money going through us. We're a large application. We have a lot of developers pushing code every day and employing many times. And as we all know, more money, more problems. So what I want to talk about today is when we're building these large systems and nowadays really building distributed systems is the default. We're all starting to use the cloud more and more, which means that we're using all this hardware and all these components like routing and network that we can't control at all. So we have a new reality where we have to build system from a lot of components that we don't control. And these components can fail. And this is getting even more applicable with things like Docker, microservices, and these new architectures where more and more components are introduced being on the rise. And that means that your job is now also to tame the relationship between all these different services. And so what I want to talk about today is how you make that reliable. And this has been the biggest win for my team in all the time that I've been here for two years. We now have confidence in what happens when the different components fail when they become slow. And we have awareness of everything, much more about what's going on and we're able to reason about the system. This means that we sleep a lot better and we're paid a lot less. And one of the things that, and one of the biggest parts of my job is preparing for Black Friday and Cyber Monday. This is a crazy event for us where we have a lot more traffic coming in and some of our stores do flash sales on top of the hour where they might double our traffic or more, where we have hundreds of thousands of customers coming in within the same minute. And every year around this time, my team starts talking about what we need to do to prepare for Black Friday and Cyber Monday this year. And last year, around this time, we were seeing a lot of embarrassing failures. Things were failing left and right and sometimes the entire system went down even when something that seems to be trivial in the big picture failed. We didn't have a great overview of all the relationship between all these services, so we set down a team of five to seven people and fought long and hard about how to tackle this. So that's what I'm going to talk about for the rest of this talk. I'm going to talk about resiliency. And resiliency is the practice of building a system from many unreliable components that is reliable as a whole. If you have a single service that is failing or is slow, it can't compromise the availability or performance of the entire system. You need to have loosely coupled things that can act on its own to have a reliable large infrastructure. And if you don't do this, your uptime will suffer under the faith of the microservice equation where if you have some amount of services, your availability decreases exponentially as you're adding more services. Even if you have something like four nines with 10 to 100 services, this really, really quickly cripples down to days of downtime per year. Now, this is a very rough estimate and you're very likely not in this situation, but this is really the worst case scenario. And as anyone who's ever seen an exponential graph that looks something like this, very, very quickly as you're adding services, you're decreasing your overall availability and really your system is only as weak as your weakest, or as strong as your weakest single point of failure. So if you're not aware of this and you're not thinking about the overall resiliency of the system as you're adding services, adding components leads to decrease in availability. And many of us might think, well, we have a big monorail. It's fine. We have a big monolithic application. It doesn't have any external dependencies. We're good. But really, that's lying to ourselves. Shopify is a big proud monolith, you might say, but we still have tons of dependencies. We have big relational data stores. We have tons of key value stores. We're talking to payment gateways. We're talking to APIs. We're sending emails. We're talking CRM systems. There are easily tens if not over 100 dependencies externally to us that we don't always have control over. And each one of these can compromise the system or the performance of the entire system if you're not careful. So now we're going to talk about fallbacks. So when you're curing one of these services, you have some data that you want out of it. For example, if you browse the Netflix page, you get an overview of all your titles, and each title has some star rating. Now, if that service that's serving the star rating is down, you can do two things. You can fail the entire page, which is the default in the high-level language like Ruby because it will raise an exception trying to connect to it. Or you can have a secondary reasonable behavior, which is just falling back to something like five gray stars or something like that. That's much better because we can still browse everything and we're not compromising the entire system just because that one service is down. And let me take a more nearby example for us. We have stores that sell sneakers in this example. And when we're rendering the store, it's made up of a lot of different services. I might have a service that does search. We might have one that stores the sessions. We might have a data store for carts. We have a CDN dependency. And MySQL shared where the job is stored on. And there might be many more dependencies than this. So imagine if one of these dependencies fail. All of these are somewhat orthogonal to the entire page. For example, if we kill the session page by default in Rails, if we kill the store where the sessions are stored, except if it's stored in a cookie, you will just get an HTTP 500. It will raise an exception and the customer will see something like this. But that breaks the principle. And one service that now made the entire application 500, even though it's not that important to the storefront. Customers can still browse and so on, even though the session storage is down. So that might look something like this. Well, if the session storage goes down, we just sign the user out. They can still browse the storefront. They can still do checkouts. They can still add things to their cart. They can still do anything except associate orders with their account. This is great. The customers are happy. They can still do checkouts. The merchants are happy because they don't even notice this downtime. The infrastructure team is happy because when this failure happens and we get page, we're less stressed out because we know that the application is coping successfully to this. And like this, you might go through every single one of the dependencies and make sure that this page is acting well when one of them are down. For example, let's say that the cart service is down. You could just make sure that you can't add something to the cart to the store. That's a great first revision. But you could do something extremely sophisticated where you're adding the cart items to a local storage in JavaScript and then persisting it when the cart persistence layer is back. You can do very, very advanced fallbacks if you need to. So the code might look something like this. We've all written code like this where we get a user and we fetch it out of the session layer. And again, if you're using cookies, this doesn't really apply. But if you're storing it in something like Redis where this user is not stored, this applies. So the problem here is that if you can't access the session layer, this code will raise an exception to give the 500, the first case that I showed before. So the more resilient code will look something like this. It rescues the error and returns back a nil. And the great thing is that the template already does this. It already checks if there's a user and if there's no user, it just shows the user is signed out. So everything else just worked. So we had to add two lines of code to make this way more resilient. And you can do this with a lot of other things. You just return an empty data structure when the data stores down. This is much more reasonable. So now the great thing in the Ruby community is that you ask, how do we test this? And the first thing you might do is do a mock. So you grab your memcache driver and you try to mock out and raise some errors and simulate a problem. But this very, very easily ends up covering as many boxes as uncovering. So, and you're writing one for each different client. You're writing one for MySQL, Redis, memcache, elastic search of what you have. And it's really hard to do this at the right layer. We did this to start with and it easily became 100 lines to closely mimic the real behavior of the driver. Now you can take the other extreme and do it in production. Kill nodes, slow them down, do something like the Chaos Monkey, which I'm going to talk a bit about later. But that doesn't really, my inner Ruby is not really happy with that solution either. I want to be able to test this. I want it to be reproducible and I want CI to run with all these tests, making sure that I'm not regressing the resiliency guarantees that I had two weeks ago. So there didn't really exist anything that did the middle ground and the middle ground is to try and emulate this at the TCP layer. If we can slow down all these connections and simulate fears at the TCP level, that means that we don't have to write a failing client mock for every single driver and we're not testing in production and we can run this proxy in all development environments and on CI. The problem was that this proxy didn't exist so we had to build it. So Shopify built a proxy called Toxy Proxy and it's a proxy that allows you to apply toxics to different channels on this proxy. It's at the TCP level so instead of connecting directly to MySQL, you connect to MySQL through this proxy. And then via HTTP, you can simulate different failures. You can tell now it's slow, now connections won't close and so on and so on. And we found a ton of bugs with this. We found bugs in Rails, we found bugs in different drivers, we found bugs in our own code and this was way superior to using Mox. And every developer is now running this and it's running on CI before the test pass. And we even went as far to create a page in our admin bar where our developers can tinker with all the different connections, slow them down, kill them and apply all these toxics to the proxies. So now we can go back to what we talked about before, this resiliency of the session storage. So we write a test, we tell Toxy Proxy, the session storage within this block, it sends an HTTP call to Toxy Proxy, tells it to kill that connection, then you do all your code, you get the front page, you make sure this is a success, maybe you're checking for a flash that says that the session storage is currently down and this works great. But now that we've talked about both fallbacks, the plan B of all of these different functionalities, and we've talked about tests, we've talked about whether you're resilient as a whole. There are many, many different dependencies between your different sections of your application and all your components, so how do we get an overview? So what we did was we came up with a resiliency matrix. We have out of the, out of the right, we have the different sections of our application. This might be a checkout, the storefront, the administration panel, and on the left, we make up these different sections. So our MySQL, our caching tier, our search tier, message queue, external gateways, and so on and so on. And red here indicates that when the component on the left is down, the section on the right is unavailable or available or degraded. And we started filling this out with Toxy Proxy and we were shocked at what this looked like. We hadn't really paid much attention to this and it just grown this app over 10 years and it looked like somewhat of a nightmare. So what we did was we sat down and we wrote a test for each one of these cells. And some of these were still 500, some of these were successes and then the mission of the team just became to flip all of these 500s to success. And as we went through this, it wasn't only in our code there were problems. We found a couple of bugs in Rails and this is one of them where there's a middleware that is instantiated to query cache and when that happens it grabs a connection. But if there's no connection it tries to establish one. So this means that even though you have a page that doesn't access your database it still fails because this middleware doesn't know anything about whether the page is going to access the data to you later on. So if you right now have an app that requires active record and you have a page that doesn't access MySQL when MySQL is down you can't access that page. This is still a bug in master we monkey patched it and I'm a terrible, terrible person for not submitting this upstream yet but there's a couple of things that need to be addressed in Rails like there needs to be a general exception from all the drivers when you can't connect to something before it's going to be addressed at the right layer. So the problem is with this line of code where it's trying to establish a connection if it's not already there. So let's take another example so we have a product and the product has tags and for whatever reason these tags are stored in a Redis instance and now looking at the session code from before you can start to connect the dots of how could we make this resilient? Well let's just do as before we rescue the Redis error and return an empty array. Template copes with this just fine it just iterates over an empty array doesn't show tags customers are happy but as your database or not your database but your code base grows you start to gather this information all over and very quickly you leak the abstraction of the resiliency layer all over the code and you sprinkle all these rescues all around. So what we started doing instead is building these decorators these data structures that we were storing and secondary data stores we provided resilient abstractions around and we only needed a handful of these and this means that 80% of the time our developers don't have to care about the resiliency layer it's already done for them and when you're and only when you're introducing a new dependency do you have to add another one of these so that's the first part where we're talking about fallbacks and we're talking about building resilient code but really the majority of the time on my team wasn't really spent on fallbacks and on writing tests but on dealing with slow components this was a lot lot harder than anything else I've talked about here because these petitions and network petitions and slowness and latency take very interesting forms sometimes the entire thing is slow because there's an I.O. problem on the data node but other times only some of your machines are experiencing failure maybe Amazon is having a problem with one of its core routers and you happen to have a couple of services on them you don't know and we can illustrate this with a very very simple thing it's called little's law and it's from queuing theory which Nadia talked about earlier and really all it says is that if your response time goes up your throughput goes down and that makes sense we can we can illustrate this with an example we have a web server in this case a unicorn and it can only serve one request at a time and now one of our data stores is slow it's just not responding but it's not rejecting right away either so someone is hitting this page he's hitting slash impacted and he's hitting that time out of half a second to the data store but while we're waiting for the time out another request comes in it's going to slash okay which take 20 milliseconds that's the response time that we've done all our capacity planning according to so that person is just waiting behind the impact of one and there's another one and another one and this one is the slow one and very very quickly all of your different workers starts to have backed up requests that are both that are slow and fast so again we're breaking the principle from the start allowing one slow component to let all the other components fail and all the other interactions fail even though we have a time out in place and we can illustrate another problem with timeouts with this example we have a symbol architecture where we have a load balancer and it's distributing load between a couple of services and they're talking to some data stores in the back end and an external API and they're very fast and everything is great but now this data storage that we're accessing extremely frequently this could be something like a redis it's about 0.2 milliseconds to 2 milliseconds and this seems pretty innocent and it seems innocent enough that we don't have a timeout of 2 milliseconds that's crazy low that doesn't even account for things like network level transmits that take hundreds of milliseconds so you wouldn't have a timeout of that low either and this 10 fold increase in latency from that data storage frequently access from everywhere else also results in the throughput dropping by an order of magnitude this follows from the Littles law that I showed before so timeouts are really not good enough once you're at scale your response time suffers you suffer under Littles law and setting the timeouts right is extremely problematic they're either too high and you get the problem with the unicorn and all the back requests or they're way too low and they don't account for all the outliers you might have customers that have a lot more data than your other customers and requests for them might take a couple milliseconds more but you don't want to completely cut them off to solve this problem either so timeouts are not good enough we need ways to fail faster than simple timeouts so there's a great book called Release It and it describes among other things some heuristics on failing fast and building really really good software I definitely recommend reading this book if you find this interesting so it describes at least two heuristics for failing faster than timeouts it describes circuit breakers and it describes bulkheads and these are two different patterns that complement each other pretty well on failing fast so circuit breakers essentially is a mechanism where you make the heuristic that if you have done a couple of requests to a data store or a service and these requests in the past have failed then it's very likely that in the future they will also fail at least for some amount of time so what you do is every time you do a call to your data store or something like that or another service you first ask is the circuit open if the circuit is open you instantly trigger an exception if not you go ahead with the request to the back end so did that request fail if it failed you market failure and if enough of these failures happened within some sliding window then you open the circuit and all the future requests will fail if the driver didn't fail well then you market success and if enough success has happened within the time frame then you close the circuit again and then you end the call and return back to the client so the way that you close the circuit again after some time is that after some timeout let's say 20 seconds you allow a single request to go through if that request went through successfully you make the heuristic that okay it looks good now and then you close the circuit forever so this is a really simple heuristic that helps a lot when your components are already failing you give them a little bit of room to recover if you have if you ship an unindexed curator production for example your databases are extremely overloaded and maybe your DBAs can't even SSH in or something similar or can't recover the problem in some other way so this is great it fails fast after several timeouts this heuristic is pretty easy to understand in most cases the problem though is that if your timeout is high let's say your timeout is something like 20, 30 seconds then this doesn't help you quite too much in the start because you will have something like 3-4 requests making up about a minute for all these circuits to trigger and during that time you're completely unavailable so you have 1-2 minutes where you're down if your timeouts are high enough or even more so we need a better way to reason about if your timeouts are accessing a data store at once or a service and this is where bulkheads come in what you do with bulkheads is that every time you're requesting you're requesting a service or a backend data store you have to acquire a ticket so we get a request into one of our applications and it grabs a ticket and then cures the data store now at this time this data store has become slow and it's not responding back to the client the client is now waiting for that timeout so the next request comes in and another ticket is taken and that request is hanging as well so when we get the third request into the third application it's rejected right away at that ticket layer it can't acquire a ticket so it's failing right away and that means we're unlocking this worker to do useful work it can now do work that's communicating with the other backends and this solves the problem from before as well with the Unicorn with all these requests accessing different different services during the requests where you're freeing up the worker to work on useful things because you can reason about how many workers at once are curing different data stores and you can really think of bulkheads as a way of seeing a threat pool and many of you might already be using a threat pool and might not even have realized that it has these benefits but one thing about it is that it ensures that it ensures that you know exactly how many things are curing a back-end service at once it fails faster than the circuit breakers when the timeouts are high because in the case where I showed before if you didn't have the tickets all of them would have to trigger their circuit breakers or you would have a global circuit breaker but that has a lot of problems like synchronizing state which you have to be resilient to as well so these three patterns timeouts, circuit breakers and bulkheads all complement each other really well and they all stack on top of each other and make your application even more resilient to these failures so you might ask yourself, this sounds pretty advanced, do I really need this and you probably don't if you haven't seen these problems yet then you might not need this but now you're equipped to recognize these problems in production and you have the solution and it's really really easy to use but these problems didn't surface for us for many years but if you are seeing these problems in production, I would definitely look into it so we implemented a library for this, we called it Semian and Netflix implemented their own library which they call Histrix and they've done some extensive documentation on resiliency Twitter has a library as well and if you have a lot of services I definitely recommend looking into these libraries they have grade readme's, grade wikis these are all these toolkits for failing faster than timeouts Netflix built this thing called the Semian Army and what the Semian Army does is it's a collection of these monkey scripts that kill your servers in production, slow them down artificially and in some cases kill entire regions at once this is really useful for testing in production whether all these resiliency patterns that you applied actually work whether your fallbacks work and so on and in the end you start climbing this resiliency ladder of maturity as your application grows you start somewhere at the bottom where you haven't really done anything and slowly you're maturing your application along with the product and you start adding Toxie proxy tests build out a matrix of the different dependencies you provide application specific fallbacks like the session one I showed before or carts you start looking at the resiliency patterns for slow backends you might start doing practice days where you kill kill nodes yourself and check how the entire system appears I know that Google has a like an RPG game where they sit down and have a game master that says imagine if these services failed or became slow what would happen and they sit in a room and argue about this and there are a lot of ways to make this fun and once you're confident enough with that you start adopting some of these scripts in production and let them kill nodes hold them down artificially even at 3 a.m. when you're really confident and at some point you get to the point where you can kill entire data centers and you're fine so the final remarks here are I would recommend anyone to sit down and draw a resiliency matrix for the application this doesn't take more than an afternoon it's really really simple to write test for with Toxie proxy and it gives you a lot more confidence in what your application is and all its dependencies and failure patterns are and not everyone needs to circuit breakers but certainly I think everyone can benefit from looking at fallbacks and looking at drawing out their resiliency matrix and be really careful when you're introducing new dependencies and new services because very very easily you end up building the same monolith you had before with a ton of services with really rusty bad pipes in between be careful when you're introducing new services because it might actually decrease your availability in the long run and we wrote a lot more documentation on what we've learned from resiliency in the readme's of Semi-In which is the library for failing fast in Ruby we've written a lot of documentation for Toxie proxy and I wrote a blog post as well that gives an overview like this talk thank you