 Hi, everyone. Thanks for coming. I'm not going to talk about cryptography today. The talk I want to give today is called How to Stay Alive Even When Others Go Down. And I want to talk about how to write and test, more importantly, test, resilience of your applications. And originally, the title was Tesla and Ruby. But after I made the start, I realized I didn't really say that much specific to Ruby, so I kept it a little bit general. So let's talk about this guy. So the goal of this talk is that you should be able to become and know how your application behaves. If something goes down, you should be able to know exactly what will happen without freaking out, without any unknowns. And the main takeaway here is that you want to gain this confidence as early as possible. And during testing, and not just in production, because then it's usually too late. So for the background, I work at Shopify on one of the infrastructure teams we have. I'm not going to talk too much about anything specific to Shopify. I'm just going to use it as a couple of examples to illustrate what I'm talking about. But yeah, this talk is not about Shopify specifically. So when I say How to Stay Alive Even When Others Go Down, let's talk about who's others. What do I mean when I say others? And this can really be anything that's not the application that you're looking at right now. So if you have a monolithic application, it's every other application besides that. Or if you have a service-oriented architecture, you're going to look at every single service. And from this point of view, others just means all the other services you have. It could be a third-party service. Maybe it's S3, or some API you're integrating with, or maybe it's a part of your core infrastructure that's not actually the application. So yeah, any of those could mean others here. So at Shopify, we still mostly run a monolithic Rails application. It's a huge Rails application. There's a couple of smaller things around, but it's mostly one application. And if you are in the same situation, then you might say this talk doesn't really apply to me. Why do I even care? So I want to say for the purpose of this talk, I want to redefine what most people think of as a service. So for the purpose of this talk, when I say service, I just mean any kind of logical unit of your application. That's not necessarily a different process. It might be part of the same application. But maybe it's just an abstraction. Maybe it's a class, and maybe this class has a back end that's some data store, or maybe it's doing some network calls, anything like that. But it doesn't necessarily have to be a separate service in the microservice architecture kind of sense. So a couple of services that in my definition of service you might have without realizing is the database. You have a caching server. Maybe you have a session store or an authentication system. There's some place where you store your assets, or maybe you have a search system. So maybe that's a elastic search, or maybe you have a background queue system. So all these kinds of things, everything that's not part of the core application, I would consider a service. And yeah, so the goal of this talk is I want to talk about a couple of design patterns and best practices and things that we learned at Shopify that made a huge difference to the stability of our platform. And I hope that some of those patterns are going to be useful to you as well. They were super important for us, because about a year ago we had a time where we had a lot of outages and we had a lot of pain from just bad design and things being coupled together that shouldn't be coupled together. And then we had this huge cleanup initiative. And we stumbled over those patterns. We tried them out all. And so today I want to just talk about what they mean and how you can use them yourself. So the one in green here, Horbex, that's going to be the easiest one for you. So if you don't remember any of the stuff I'm talking about, then this is the one that you will remember, because it's super easy to do and it's a super easy win. And testing is in red, because that's the one I think is the most valuable on a long-term basis and the most valuable, I would say. But for the beginning, let's start with timeouts. So most of you probably know what I mean when I say timeouts, but to understand why this is important, let's look at a couple of performance metrics that are usually super important for web application and if you do run a web application, you've heard the term capacity. So at Shopify, we have this website where every developer can deploy code to production server and before you can do it, there's those two graphs here and there's a little checkbook that says, please look at the graphs before you actually press the deploy button and if anything changes after the deploy in those two graphs, then you should say something. So those two graphs are basically the most, not the most important, but the two metrics that you look at first when you ship new code and this is response time and throughput. So response time just means how long does it take to handle one request and throughput means how many requests can we handle in a given time. And many of you are probably running an architecture that's similar to the NGINX and Unicorn setup. So to understand why timeouts are important, maybe this is not obvious to everyone, but this bunch of workers at the bottom, so how Unicorn works is when you start the process, it forks a couple of workers and the important thing is that the number of workers is not a function of the system load or the number of requests, it stays fixed the whole time. So if you experience higher load, there won't be any new workers but the number of workers is always the same, which means when all of those workers are occupied, you can't serve any requests. You're gonna start queuing them and eventually you will be unable to handle more requests. So if we look at this picture again, if you work in operations or if you have been around during an outage, you might have seen a picture that looks more like this. And this illustrates very nicely that those two graphs have a very important relationship to each other, so you can see on the right that the requests took longer to handle and that led to what you see on the left, which means that the throughput goes down. So maybe that's not obvious to everyone, but the takeaway is that those two are related and making requests slower because of whatever reason will degrade the throughput you can handle. So when we talk about capacity, it's hard to define exactly what capacity means, but for now let's just say it's the maximum throughput that you can handle given a reasonable response time. So this picture here is what happened here is somebody shipped a maintenance task that did some data migration on one of our Redis servers and they were accidentally using Redis command that has a very high complexity. So if there's a lot of keys in your Redis instance, there's a couple of commands that you should never run. Keys is one of them or a couple of other ones. So what happened here is that every single request that, so the Redis server became unresponsive, all the requests timed out after two minutes or something. And so every web request that we talked to this Redis instance would run into this 30 second timeout and this leads to a degraded capacity. So even though this is somewhat specific to the Unicorn worker model that I talked about, it in some sense also applies to any other web server or to any other, for example, a background queue system as well. So anything in the end you are always limited by the number of workers you have or the number of system resources you have. So even if you do something multi-threaded or something like that, you start having this upper bound. So this is why it's super important to have timeouts because if we talk about failing, so the idea of timeouts is to fail fast. So failing can mean different things. So the best thing that can happen to you in an outage is that the connection is refused immediately and nothing works. The worst thing that can happen is it kind of works, but after 30 seconds it breaks. So maybe it times out after five seconds or maybe you get a connection but after 30 seconds you realize there's no data coming in or maybe the server returns an error or something like that. So there's different ways the service can fail. And the connection refuse is the best one you can get because the other ones can do a lot of damage. So if we look at a Ruby application, so something that we looked at, this is all a bunch of Ruby gems that we use in Shopify. Most of them don't even have a default timeout. So if you don't specifically set those, then you might have the same problem. So for example, the Unicorn Web Service by default allowing we request to take up to 60 seconds, which means that for 60 seconds this worker is blocked and can't handle any new requests. Net HTTP as well, six seconds. The MySQL and the Redis gems, for example, don't even have a default timeout at all, which on a Linux system usually means something like the kernel is gonna kill the connection after two minutes, which for something like Redis is even worse because Redis is a single threaded server and if there is a request that takes two minutes, then no other request can be handled. So what you wanna do for this first, to apply this first pattern is, you probably wanna instrument all of your code, all of your clients by, I don't know, for example, you could use StatsD or maybe you're using Urelik, but you wanna get some insight into what your baseline is and then go as low with a timeout as you can afford to. So if your baseline is like five seconds, then there's no reason why you are allowing 60 seconds because those outliers can totally ruin your usability and your capacity. So sometimes there's legitimate reasons why something takes a long time, so maybe there's a HTTP request that does actually usually take 60 seconds, but if you have that, it's probably a good idea to move that into a background job and don't do it in line. So even though you still have the same kind of problem in background processing, so if you're using something like Rescue or Sidekick or whatever, you have the same constraint that you have a fixed number of workers and if those are occupied, you're not doing any work, but that's usually not as bad for capacity because that's not something that the browser or the customer sees. If the background system is kind of slowed down, then it's not as big a deal as if the web server is slowed down. So a couple of real-world examples that we saw is, for example, the Redis command that I saw that might be slow, my SQL query is missing index or maybe somebody is doing an expensive join on a big table or something like that, or something that we see a lot is if you use, for us, this is mostly a problem with JBM applications. So for example, we use elastic search and if you, this is weird behavior, if you give elastic search too much RAM, then, or if the size of the heap memory is too big, it's gonna do a lot of garbage collection and often this garbage collection cycle stops the world and while the garbage collection is happening, it doesn't handle any requests and if you have not enough nodes in your cluster and they're all doing garbage collection at the same time, this can make the service unresponsive. So this is not about, this is not only about services failing, but you should think of the service being slow as a kind of failure mode because if elastic search or any service just crashes and doesn't accept any connections, that's not as bad as it being super slow. So yeah, just think of that. Something else that we saw is problems with the network. So maybe there's, for some reason, there's some packet loss or a lot of TCP retransmits or maybe there's a, at the beginning, we had a couple of data Hadoop nodes on the same switch as our production servers and they would saturate the network link and cause a lot of packet loss, this kind of stuff. So this is all stuff that can, that has caused a service disruption in Shopify that could have been prevented if we would have been using a stricter timeout. Okay, so the next pattern I wanna talk about is something called a circuit breaker. So the main idea of a circuit breaker is that if you have an external service and you're talking to it and you're getting failures, then if you talk to it again right away after the failure, it's very likely that it's gonna fail again. So the idea is to give the service a little bit of time to heal, to recover by basically leaving it alone for a little while. And the way this is implemented is you're just gonna keep a count of how many errors you're seeing in a certain time window and if this error count reaches the threshold, you're gonna switch to, you're gonna break the circuit. So the naming comes from an electrical circuit where you have fuses that break the circuit to prevent the wires from overheating and putting down the hours. So this is like the analogy. So the way you would do this is you would have a separate circuit, or a separate circuit state for each service that you're using so that the errors of one service shouldn't break the circuit of the other service. Each of them only has its own state. So just briefly, how does this look like? So if you wanna implement this yourself, and it's pretty easy, you should do it, you should try it just as a learning experience. There's basically three states closed which is the default state, everything is fine state, and there's open and there's half open. So basically closed means you're just doing your requests if everything works, then that's fine. You're gonna stay in the closed state. If you see errors, you're gonna transition to the open state and in this state, if there is a new call or if the code tries to make another call, you immediately raise an exception. You don't even try to talk to the actual service. So this whole pattern is something that you would usually implement, for example, as a patch to the driver. So for example, if it's a database or you would implement this as a patch to the database driver. So to the application itself, this is invisible. It doesn't actually know that it's not talking to the service when it's getting this error. And then you keep a timeout, basically, that says if I'm in this open state and there hasn't been any errors for a couple of seconds or whatever the value is, then you're gonna transition into this half open state. And half open basically just means it's closed, but if there's a single error, we're gonna go back to open immediately. So it's gonna not go through this threshold again. So yeah, as I said, as an implementation tip, implement this at the driver level. When we started experimenting with this, we sprinkled this circuit breaker code all over the code base, which worked as well, but it was very hard to maintain. And if you have a lot of developers working on the same code base, this can be a bit of a... Yeah, you don't want all of your developers to know about this. And it's better to push it down the stack and make it a little bit invisible. And one of the benefits you get from that is that if you happen to forget a certain code path, it's still gonna benefit from the circuit breaker because just because it's happening at such a low level. And if you have two, for example, you have two different services that use the same backend. So let's say you have two services based on both of them use Redis as a backend, then they basically, because you implement this at the driver level, they share the same state, which means if one of the services is seeing errors, that will benefit the other services as well. So, yeah, as I said, you would basically define a new type of exception. So we call it a second opening exception. And the application code itself doesn't actually know anything about circuit breakers, which means you don't have to change all the other code. I'm gonna come back to this later with an example to make it more clear. So another pattern that's closely related to circuit breakers is the idea of failing gracefully. So when we had this series of outages a while ago, we tried very hard to fix all the problems and fix all the root causes and make sure that we don't go down. But then eventually you have to realize that there's always something that's gonna happen. You will go down and you should spend an equal amount of time preparing for that and making sure you know what will happen if you go down, because you just can't fix all the problems. So the idea is to fail gracefully. So something that I often heard when I started working on Rails is don't be defensive, don't rescue all those exceptions. If this happens, we have bigger problems. Like just don't bother, just code for the happy path and don't issue many errors, which is fair opinion in terms of developer productivity. But I would say in some cases, it's bad advice. So what I wanna illustrate is that sometimes it can make a lot of sense to be defensive and to illustrate what I mean, let's look at this example. So this is a Shopify store front. We, Shopify runs is an e-commerce platform as a service. So we allow people to build their own online stores. And if you look at what kind of services are actually involved in this, you have a couple of things. So we have a couple of reddices here. We have cards, we have sessions, we have inventory, we have caching. And if you build all of this in an e-way, then an error in every single one of those components would break the entire page and just serve an error to the browser. But if you actually think about what happens here, there's often a lot of different reasonable fallbacks that you can do instead of serving an error. So maybe if we look at the card service at the top, so if I have something in my card, it's gonna tell me how much money that it's worth. So if the card service happens to be down, instead of breaking the page, I could just not show this box at all or say just pretend it's $0. That's a little bit inconvenient if people can't put stuff in their card, but that's better than breaking the page. And then if you have this fallback, then at least people can still look at the product and can at least browse around. So the idea is to degrade the user experience, which is always better than breaking it completely. And you might say that some services you have, there's no reasonable fallback. It doesn't make sense to do anything in an error case. You just have to make sure that it doesn't error. But if you think about it for a little bit, you will actually be surprised by how many compromises you can make that will keep you alive. So a couple of examples that we ran through. We use elastic search for product searches on the storefront. So if the search system happens to be done, we just return an empty result set, which is not the best user experience, but it's better than to break the entire storefront. If the session store is done, you can still do a guest checkout, which is also less convenient, but still better than being done. If maybe you have a recommendation system, so instead of showing personalized recommendations, if this personalized recommendation service is done, you can still, maybe you could still show generic recommendations. Something that we use for a couple of things is distributed locks. So think of it as a service that makes sure that no two processes run the same code path at once. And if this lock is unreachable for some reason, instead of breaking the code path that is trying to acquire the lock, you can just pretend that somebody else has it. Yeah, there's lots of other examples. So maybe you do A-B testing, that's something that we did. A-B testing basically just, it's like a data analytics thing where you divide, in our case, shops into two groups, a test group and a control group. And previously, what we would do if we can't determine if the service that helps you which group you're in is down, we would break the entire page. But it actually makes more sense to just assume one of the groups. So there's all these kinds of graceful followbacks that you can do that are degraded user experience, but much better than errors. Yeah, so same as with a circuit breaker, actually if you implement this, often you would implement those two things kind of together because they're very related. So same here, try to push it as deep down the stack as you can, which sometimes is tricky because the fallback for a certain service that makes sense in one area of your application might not make sense in another area, so you might want to have different fallbacks for the same service, but in our experience, if you write very good abstractions, then this is not really an issue. And it's a nightmare to maintain this at a higher level. So something that's a little bit that you need to watch out for is monitoring and alerting because if you have those fallbacks in place everywhere, you might not know if you have a problem. So a lot of people probably use something like air brake or any kind of like error reporting system where on each exception that happens in your code, you're gonna notify a chat channel or send an email or something. If you are risking all those errors and you have fallbacks and something actually goes down, you're not seeing any errors, so you still, even though you have good fallbacks for everything, you still need good monitoring and good alerting to know that something is wrong even though you're not serving errors to the customer. So that's very important. So there's one piece of Ruby code that I have and everything else is not really specific to Ruby. But this illustrates my point about a couple of things. So we have this data store and as I said, we implement this circuit breaker and now we have a circuit open error which is subclass of whatever the client's base error is. Then you have this service. For example, we have a shopping cart here. And so if there is an error with the data store, then the shopping cart rescues the base error which also catches the circuit breaker error. So the important point I wanna make here is that it's very important to have good abstractions and as you can see here, the shopping cart itself doesn't know anything about circuit breakers. That's an implementation detail of the client and then the radius controller doesn't know anything about the data store. That's also an implementation detail. So it's very valuable to not couple those things together and come up with proper abstractions. It makes it a lot easier to deploy all those resiliency patterns. So the next one I wanna talk about is called bulk heading and the name comes from a boat where you have those chambers in the boat to make sure that if one of them fills up with water, it doesn't spill over into the other one and sink the whole boat, but it isolates the failures of one area and makes sure it doesn't lead to cascading failure in other areas. So the main idea is isolate the failures. This is a little bit of more like an operational concern, not so much about a software design pattern and it's a little bit vague, but there's a lot of examples that you can do here. So something that we got a lot of value out of is limiting concurrent access to a shared resource. So maybe you have, let's say you have a monolithic reds application, maybe you have one database server and 50 app servers and you don't actually want to allow all those 50 apps to talk to the database at the exact same time and in the most reds application that's totally unnecessary, not most of the time is usually not spent in the database. So what we do is we implemented a system based on semaphores, which is this concurrency data structure. It's basically like a, you can think of it as like a ticketing system, so you would say we have 50 app servers, but we only have 20 database tickets and if you want to talk to the database you're gonna go get this ticket, talk to the database, when you're done you bring back the ticket and if there's no tickets available you have to wait for someone else to bring back his ticket. And you're doing this all at a very low level so the application code doesn't see anything of it, but what this does is if there happens to be a super slow query that's blocking the database and all the requests are taking longer and now there's 50, a 25 app server stuck on the database waiting for it to return, it makes sure that all the other 25 app servers are not hammering the database even more. So this is, yeah, this is about making sure that an error in one area, so a slow query in this page is not gonna make it even worse in another area. So the idea is similar to circuit breakers by saying that you wanna give the service time to heal. And there's lots of other examples here, so something that we often see is people have different applications but there's only one database server and this one machine runs several databases all for the same, for all those services and then if one, so the problem with that is, of course, that if one of the applications overloads this database, then that's gonna affect all the other ones. So this is something you wanna avoid. The same is true for Redis, or my sequel, Redis, for example, has this concept of a logical database where you can say it's basically a different key space but it's just still the same process. And this is bad because as I said earlier, Redis is a single-threaded architecture, so even though they're in different logical databases, something that causes problems in one of those logical databases is not isolated from the other ones, so you wanna make sure that those things don't impact each other. Otherwise, there's a problem in one area of the side that's gonna triple into all those other areas that you didn't even know were related. One interesting thing that we found out recently, we have this one area where we have this, yeah, there's one area of our side that we do low testing in and if there's errors in that area, it doesn't actually affect any customers, but it does enqueue a lot of error reporting jobs. And so we were hitting this one area a lot and enqueuing a lot of error reporting jobs and this actually overloaded our entire queue. So this is like one example where problems in one area can kill another area and you really wanna watch out for those kind of things. So this is all very, it can be all very overwhelming. You don't know where to start. If you, when we first started looking into this, there was so many things that were like eye-opening and it seemed like everything was broken and we needed to apply those patents everywhere and Shopify is a pretty big code base and so we were kind of lost, we didn't know where to start. So what we found valuable is to look at the hot paths, look at everything that has the most traffic, that's also the area that can do the most damage if there's a problem because that problem is gonna be hit by a lot of requests. Every application has this one area where if this area is down, you're gonna lose money by the minute so that's also something you wanna focus on. Something that's often forgotten is deploys. So we had to spend quite some time to make sure that we can always deploy. So it's very bad if you ship some code and this code breaks something and you can't revert it because it deploys and you're working anymore. So we spent quite some effort to make sure that the main Shopify application, we can deploy it even if all the other services, even if the database is down and we can deploy and that's not super easy. It's very easy in a Rails application to break this and to accidentally add something, for example, in like a Rails initializer that talks to the database. So you wanna make sure that nothing in your initialization process depends on any of the other services. Yeah, and one of the most important devices don't try to do it all at once but try to come up with a step-by-step plan on how to do it. And something that we found valuable for that is we made this thing that we call a resiliency matrix. So it's basically just a table and the columns are the areas of your site that you have and the rows are the services or the data stores of the dependencies that your site has. And then in each cell, you just write down what you think happens to this area if this service is down. So then when we first did this, we were pretty surprised by first of all, how much of it was read and then also how much of it we didn't actually know what the status was. We didn't even know how to find out. So you wanna, I would say just start with writing this matrix just with assumptions what you think is your current state and then try to verify those assumptions and then get this matrix as yellow or ideally as green as possible. So this brings us to the next part I wanna talk about which is resiliency testing. So obviously testing is a great tool for making sure that you're not breaking something that you already fixed in the past but it's also for us, it was incredibly valuable as an exploration tool. So if you already have an application with a good test coverage, you can use this to, in our case, generate a to-do list of things that need to fix. So what I mean by that is we never had this requirement for our code to be resilient but by introducing something new to our test suite, we were able to generate a to-do list of things that need fixing and then this allowed us to approach this very iteratively and one test class at a time basically and not get overwhelmed. So we actually used this kind of idea, this idea of using your test coverage as an exploration tool, a lot in other projects as well and it's super valuable. So what I mean by this is we have this module that we wrote, I'm not gonna go into too much detail about how the module works. If you're interested, you can talk to me later. But the basic idea, the basic thing it does is it does some Ruby meta programming to make sure that if you change code or if there is code that's being tested that talks to a service or a data store so in this case, for example, Redis, if there's any code that violates those requirements that we have, it's gonna make the test fail by saying, hey, this code is not specifying a fallback. So you're talking to the service without saying, without catching errors, without saying what should happen if the service is down. And basically the way we approach this problem is we just picked a test class, so the CardsControllerTest here, include this module, fix all the tests of that module that were broken in this way and then ship this as one per request in one contained unit basically. And then we moved on to this with the same, but with the next test class and then next and the next. So this allowed us to get a lot of visibility because you have this list of failing tests and also you have a to-do list of things that you need to work on. So this is good in one way because it tells you which code is not checking for the errors but something that it doesn't do is it doesn't actually test the error code path. So it's still, if this test is passing, it doesn't mean that the error code path is working, it just means there is an error code path. So something that you might do in your integration tests is Sam was talking about mocking earlier, so maybe you're mocking, maybe you have an HTTP call and maybe you're mocking it to return a 500 error or maybe you're mocking it to return a connection refused to make sure that the code is handling this error. But we found that mocking those kind of network calls can be a little bit tricky because often you don't really know which level to mock on, like do you wanna, if you do an HTTP call, do you wanna mock like the net HTTP library or do you wanna mock on the TCP level and which one is more realistic? So we came up with a tool that we call Toxy Proxy and it's very simple, you can implement it yourself if you don't wanna use ours. The idea is basically that you have this proxy process, it's a separate process running besides your test suite and your application in your test environment is not actually connecting to the database, for example, or the Redis server or the service you have, it's connected to this proxy and the proxy connects to the actual backend. And then in your test suite, you can say Toxy Proxy MySQL down, which means that the connection to the database is actually interrupted. This down basically sends a command to the proxy saying, hey, please shut down the MySQL connection. And this gives you the most realistic test that you can do in this way, much more realistic than mocking because the connection is actually down now, you're not just pretending it's done. And yeah, so we basically, after we, this is what you wanna do to make sure that the error code path is working as well. So we have this bunch of areas that are super important to us and so for example, we wanna be able to sort of cache it even if the database is down, which seems kind of obvious, but again, in Rails it's very easy to accidentally break this kind of stuff by introducing new database calls that maybe you didn't really have visibility of. And yeah, we just found that this kind of resiliency testing is very, very useful for this kind of stuff. Okay, so if you've done all the testing and you fixed all your code, then there's a last step which, this is the step that we are still in a little bit and it's called Keras monkey. So the idea is that you have this monkey running around in your data center, pulling out cables, shutting off servers and you wanna be able, you're so confident in your application that it doesn't worry you at all. So the way you can approach this and the way we did it is that you first do it manually, so you do all this resiliency testing and once you have some level of confidence about this area of the application, so for example, you're confident that you can lose one of the three database readers or something like that, we would go in and actually in production just kill one of those nodes or add a firewall rule to partition it from the others, fill up the disk or like any kind of stuff that can actually happen in production and even if you're not exactly sure what's gonna happen, it's still better to experience this kind of problem in a setting where everybody is already sitting in the same table and everybody is ready to start fixing and ready to unbreak it. That's much better than being woken up at five a.m. in the morning by this. So the last step, once you're super confident that this is not a problem, you can automate it. We're not quite there yet but there's a couple of companies, Netflix for example, who actually do this in an automated way. So they would go in and automatically, like at a random time in the middle of the night, shut down half the servers in one of the regions or something like that. So if you can get to that stage, that's pretty great. Okay, so I skipped over a lot of the details. So I wanna give everyone here who thought this was an interesting talk, I wanna give you a little bit of homework just to get you started with this and to show you that it's not complicated and there's a lot of easy wins to get. So for your homework, if you don't do anything, at least at timeouts because that's super simple, super easy and it can already make a great difference. If you're interested in how the circuit breaker thing works, implement it yourself, it's not as complicated as it might seem, it's maybe you can do it in like 50 to 100 lines of Ruby code. If you are in charge of, or if you're running an application, create that resiliency matrix just by filling in the cells with what you assume is the behavior and then write some tests to make sure if that's actually what you think. If it's actually behaving the way you think it is. And then if you're still interested and if you wanna learn more about like Ruby meta programming or if you wanna take this, make your tests generated to do this for you kind of approach, implement this module that I showed and if you want some pointers how to do that, you can talk to me later after the talk. Yeah, and if you're still interested after that, there's this great book called Release It, which covers a lot of those things that I talked about today and a couple of others. So there's a lot of insight here. It's kind of a little bit specific to the Java ecosystem, but there's still a lot of valuable stuff in there. The same for the Netflix tech block, they have a great library, also for the JVM, but you can learn a lot from the ideas even if you write a Ruby application. Toxy Proxy, if you wanna look at that gem is the one that we use for resiliency testing. Earlier I talked about this idea of concurrency control to make sure that some service that it's used by more than one application or by more than one work out of the same application to make sure that not too many of those are talking to this service at once. Our implementation of that idea is called Semi-In, if you wanna check it out and yeah, that's all I got. Thank you very much for your time.