 So, so today I will be talking about resilient by design. I think the previous talk about microservices with mentioned cap theorem and everything, I think it, this is like the sequel to that talk now that, you know, you want to build systems that don't go down. My name is Smith, I'm a bundler core, I'm part of bundler core team member until very recently I used to maintain the dependency resolver. I also occasionally contribute to JRuby and this is what I am on internet. I work at Flipkart, many of you would not know that, it's a e-commerce website in India. We get a lot of scale, we have a lot of scaling problems and yeah, and I'm very thankful for them to sponsor my trip here. So let's just start. So why do we actually care about resilience? Companies have increasingly over the years started depending on software. And for them at this stage, any sort of downtime would actually result in loss of business. For customers it's also bad because customers are also relying on this software to be up. To give an example, Flipkart actually makes more than $1 billion in sales. So even a single minute of downtime results in loss of $2,000. And the interesting fact is that, you know, it's never evenly distributed like that. There's no numbers that are evenly distributed. So what happens is that there are, you know, 20% of times which actually amount to 80% of the total revenue. And it's during those peak times your systems are most vulnerable. And those times, you know, even going down for a single minute could mean that you might lose close to $8,000 just at that minute. So the companies cannot afford to go any downtime on their systems. So what would they do? What are they going to do is they're going to rely on developers, support engineers and all of them. So the famous on-call is there just because of that reason. So it's up to the developer to make sure that he responds to on-call whenever it is like if it's late at night or anything. And it's up to him to make sure that systems are running up. The second reason is that even the simplest system today is dependent on other services. Like at the very least, it will be dependent on a database which is on another server. And as the previous talk said, the network is not really reliable. So in those kind of things, it's very important that, you know, the thought about resilience should be put at forefront. Otherwise, like, I don't think any of us here would like to, you know, handle on-calls over the weekend. Like, that's pretty irritating to be honest. So yes, so then the question becomes, like, how do we actually build a resilience system? So I think in the 90s, like resilience testing used to be an implicit requirement. Like the requirement was that, you know, your core should run and, you know, it should work and, you know, tests should be there, not be there. But it's like implicit requirement. Maintainable code, even those things were very implicit. And but you see those things are not the same in the Ruby community itself, testing code, maintainability. All those things are something that a lot of focus is put into it. Like it's not something like an afterthought. But the problem with resilience is that today it's more of an implicit requirement. The management expects that the system should be up all the time. And the developers also think that, you know, okay, I wrote the system, I used this data store, I have that. It's going to, you know, it's going to be up. And I think it's like, I think if there is no thought put into resilience, like when designing the system, you'll be very lucky to find any bugs before production. Because most of the bugs that you see that dealt will deal with resilience of the system happen in the production environment, happen when the load, the systems are at peak load, the utilization is up. That's when you see those bugs. And because you haven't thought about it, they are going to come and bite you. There's just no other way around. The second thing is that human bias. So humans have inherently a bias that they only think about the happy path where, you know, everything is working. Like your caching servers are out, your database is there, the services that you're talking to are responding every time you make a request. And that's why, you know, we fail to see the path where those things are not actually up. They are those things when they're not actually working. And so the only way to actually think in a different way is that we need to think about resilience from the start. Like whenever you think about your system, like whenever you are designing your system, you need to think about, OK, if my system goes down, if like, OK, have I planned for capacity if my caching servers go down or are they highly available? All those sorts have to be put in from start. There are things that actually can help you. And that's what my talk is about. Resilient design patterns. However, I would like to put up a disclaimer for this talk that these are not silver bullet. Like, I mean, it's not that if you use all of these patterns in your system and they are guaranteed to never go down. Things are never simple as that. A lot of the thing that it depends on the domain as well, like the system that you're designing. For example, on Flipkart website, our core thing is to, you know, able to serve the product page for the customer to see if it's available. And once he clicks on buy, he should get whatever he has ordered. That is our main thing. So if recommendation system is facing an issue, we could load the page without it. If the comments or reviews are not showing up, we can decide to, you know, not show them if those systems are down. Obviously, for each service, those kind of trade-offs are very dependent on it. Like, for example, Netflix, if their bookmarking service is down, what they do is they will not give you an option of resuming the playback. They'll just start from there. But the reason why they do that is that they know that the main thing is to be able to watch the videos. And hence, the only thing that I want to say that it depends on your domain, it depends on how you have designed your systems. And I think that's a really good thing because, like, there's really no free lunch. If you're designing a system like this, you need to put thought into it. And you need to think about the cases. So, yeah, with that in mind, let's just start with the patterns. So I think this is the most important pattern in this talk. Like, this is why I'm putting it first. Like, if you don't take anything out of this talk, like anything else, just take this pattern out of it. The thing is that, you know, the most wastage of resources is, like, burning cycles and clocktimes only to get results that you have to throw away. Failing fast is the best thing you can do if the system, the other services that you're talking to are not responding or you know are going to fail. In fact, the reason behind failing fast actually comes from a mathematical idea called queuing theory. So this is John Little's law for those of you who know. So the length of a queue, so say your system actually, you know, handles incoming messages and that's it. So your length of your queue is going to be dependent upon the arrival of your messages. And the amount of time it takes for them in the system, like the amount of time it takes to process them. If your response times go up, the amount of time in the system goes up, the size of the queue will increase. Like, so now let's say if you're talking to a service which is not responding and, you know, you didn't even bother to change the default timeout of net HTTP, which is 60 seconds. It's going to take 60 seconds for your system, it's going to take 60 seconds for it to fail. So your response times will be very, very high and that will indirectly increase your queue size. The other thing that is highly dependent on your responses is your utilization of your system. So the utilization goes, if you see this graph, the utilization goes up if the response time goes up. So if for each request, if it's taking 60 seconds, your utilization of your entire service will be very, very high. And the only way you can do anything about this is to, you know, you can only add more servers and, you know, hope for the best in this kind of scenario. The cool thing about this is like you can also look at the other way. So say, you know, you optimize your code like you did your best and, you know, you got the response times to a certain extent. After that, if your utilization is going above 80% still, like if you're going about 80%, you can easily see that it's going to have a very negative impact on your performance of your system. And that point you will, you can do capacity planning on based on that. And I think the other thing that's very cool about this is that say you are an agile team and the utilization of your team is around 90%. And now your manager comes with some adult task. Just using queuing theory, you can figure out that, you know, the turnaround time for that particular task is going to be very, very high. So I think like math is pretty cool. You cannot run away from the math. So the only thing you can do is like in this case is that you keep your response times as low as possible. So this is one example system that I have created for two just to illustrate. So say you have an ebook download service, like if you buy the ebook, they guarantee that they give you an SLA of five minutes, like if you buy the ebook in five minutes, you'll get an email with the download link and you can download it. And this is pretty basic stuff. You have a checkout service which sends out the messages to the payment service through a message queue because we don't want to lose those messages. And the payment talks to an external service to verify that, you know, this payment is authentic and then it processes further. Now let's assume that the external service that we are talking to starts failing. It starts timing out. So intermediate calls are failing when talking to the external service. Now what's happening there is that each payment call to external service is going to fail. It's going to take 60 seconds. And because of that, the incoming messages, their rate, you can't really control in this case. So what's going to happen is that messages are going to start piling up in that queue, that message queue. At this stage, what happens is even if the system comes up, even if the system comes live, external service, what you would have is you would have a pile of messages. And now you would also have incoming messages from the website. Like people are still placing orders. So you would fail to meet the SLA for orders which were placed when the external service was down. And because of that, you'll also fail to meet the expectation for newly placed orders. In case of that, we embrace that things are going to be bad. We use a circuit breaker in between. We realize that the call to external services are going to fail. And what we do is, in fallback, what we do is we store those messages, retry them later. And because of that, our response times are still the same. So what it's going to do is your messaging queue will still be empty because your response times are actually much better because it's not even bothering to call the external service. In this case, when the system actually comes up, what you can do is newly placed orders can meet the SLA. Like they still get the download links. And the messages which are stored in a different system or a different queue and to retry later, you know those messages, I mean those customers will not get their downloading on time. You can send out a special mail or you can give them some discount. But the main thing is in this scenario, you are in control. Like you know that these messages are the ones that have failed. And you can design your system. You're not dependent on it. So yes, so this is the most important thing in this talk. But now let's say now how do we actually make use of it? So the first thing to achieve that is through bounding. Like you need, so if any place in your system, if you have unbounded access to resources, that is something really, really terrible. Like you don't want anything like that. So in bounding, I want to cover three things. Like bounding is a huge topic of its own. But I specifically want to cover three things. The first is timeouts. So the default timeouts in any of your library are horrible. And like net HTTP, like I mentioned earlier, has a timeout of 60 seconds. So it takes 60 seconds for the read timeout to kick in and tell you that you can't access that server you're trying to go. And I think the scary part is that some of the things don't have a timeout at all. Like they never timeout. We had a system at Flipkart. What we use it for was to, so it would collect messages from the local service. And it will send it to the main messaging queue. Its job was to relay those messages. And only through this service, only through this infra piece, any service would be able to talk with the outside world. Now, this service, I mean, this infra piece would get hung every two weeks or three weeks or so. It's written in Ruby. And we couldn't figure out what was wrong. And when we went down into it, when we looked there, what we found out that it sent matrices through stats D on a UDP port. And there was one matrices that we were not reading at all. It was kind of like nobody was making use of it. And what that was causing is the buffer size of the UDP is 128KB in Linux. And that was getting full. So if that is full at that point, it would just get stuck in that state. And the way we solved it was that we use a socket non-block flag, which is you can do it using write non-block in Ruby. But so yeah, so some systems don't even have a timeout. And those kind of things, you need to look down in your application and see, does my application have a proper timeout? And the greatest thing that timeout provides is a fault isolation. So if it's another service or another thing that is not responding, it shields your system. Like you can have a timeout. And you can use timeout in conjunction with the circuit breaker, which is the next pattern I'll talk about. Or you can, if nothing else, you can use it with a retry logic. The second thing is limit memory use. So again, whenever people use caching or something like Redis, this is something they completely forget about, like limiting their memory use. Say even their web servers, like application servers, in those cases, like in case of Unicorn, you can have a watch on each of your workers. And you can say that till 85% it's OK. As soon as it crosses above 85%, you can notify the developers or something like that. The thing that happens is that when you don't have any of it and you let, so there is another case at Flipkart itself. What we had was that we had a system. And every two, three weeks, the memory usage would increase so much that it would start to use the swap. And the performance of that particular host would be really, really terrible. When we actually looked into that, we found out that at one place it was doing adjacent parts and it was using symbols. And unfortunately, one of the keys was unique every time, every single time. And those four who are new to Ruby, symbols are not garbage collected till very recently in Ruby 2.2. So any symbol that you created in your system they'll stay till you restart your process, I mean kill and restart your process. So in that case, however, none of us had to get up late at night or early in the morning to fix any of the system. What we had was a worker monitoring system. And what it would do is if the worker would go above 90%, it would actually restart the worker. And things would still go and work well. Or if it crosses, so that helped us out a lot. I think that's not actually an ideal solution. But what it gives you is time to actually debug the issue. Otherwise, any time if it starts hitting the swap, it's going to impact the business. And that is something that you cannot take. The other point is to limit CPU. So a lot of times what happens is that on your host, there are even processes running that do certain things, maybe provide health checks or things like that. And those processes are not the primary thing that's running. It's your service that's running on that host, which is the most important. But sometimes what happens is that the code in that D1 or something like that, I mean the library you're using, goes into some kind of an infinite loop, or it starts using more and more of the resources of the system. However, you can easily limit that D1 using C groups. And what that provides you is like an isolation. So even if that D1 starts to use all your resources, it's only using one core of your system. And because of that, it will not go down. And finally, every time you use a mutex.lock or like a buffer in your system, all of those are implicit cues in your system, which you have no control over. Like there's no control over those things. And it is much better to have an explicit bounded cue, like a messaging cue that sends messages to your service. And that could be bounded. Like it could apply back pressure in the case of it's full. What this gives you is much more control than over just using an implicit cue. So the next pattern, I think, is one of the coolest patterns in existence. It's called the circuit breaker pattern. So circuit breakers, the way they work is they are in between the client and the server or the supplier. And what they do is if everything is fine, then it doesn't actually even come into play. But it's when you make a request and it starts timing out. There is some connection problem between the client and the server. And in this case, what it does is that after a certain threshold of errors, it realizes that the other service is facing some difficulty. Like it's not able to do it. So it actually trips the circuit. At that point onwards, any future calls are not even made to the server. What it does is it fails right then and then. Later on, what happens is that after a certain point of time, what it will do is it will actually make a call to that other service. And it will see if it's up or not. If it's up, it closes the circuit and everything goes back to normal. But if it's still timing out, the circuit will still be in open state. And you wouldn't even need to make the call to it. Like, they're really good examples of circuit breakers. I think Simeon by Shopify is a pretty good implementation in Ruby for circuit breakers. And if you use JRuby, you can just make use of his tricks, which is written by Netflix. It's a very well-written and battle-tested library. So that is something you can do. Going forward, I think, going forward, like the bulkheads are actually a concept that comes from ship. Bulkheads are actually watertight compartments in your ship. So even if a hull is damaged by certain, partially damaged, it won't sink the ship. So the idea behind it is that a single failure doesn't bring down the entire ship. And that is something that you can actually use in your service. So say your website and in Logistics, both need a product information. So a website needs product information to show it to the user. Logistics needs to know the product information to determine if the item is dangerous, or can it be transported using air or road, depending on what category the item is. Now, in this case, say the website is facing tremendous load, like there's a product launch, and a lot of people are making use of it. So what's going to happen is that the load on the website is going to affect the product service. So eventually, what's going to happen is that the website will bring down product service because of the high load it's experiencing. So at that point, even Logistics can't do anything, even Logistics is impacted. And once the Logistics system is down, any systems which are dependent on that service will also go down. And this could actually trigger a cascading failure throughout the system, like each dependent piece going down. However, using the bulkhead pattern, what we can do is we can actually have a dedicated service for the website and Logistics in the product service. So even if the other one is experiencing a lot of problems, the other service is actually shielded by it. It won't be impacted by that. And the thing is bulkheads are very different from adding more capacity. Like adding more capacity could still result in the problems that I mentioned earlier. Here it's separating the servers so both of them don't impact each other. However, there are multiple other things which you can also use for bulkheads for. So bulkhead as a concept is very powerful. So say you're using circuit breakers and you have a thread pool for each of the service while making the call. And each of those are different thread pools. And one of the thread pools you realize is completely saturated. You realize that there is no free threads. At that point, you can actually fail to call that service. Like you can fail that there, and you can use the fallback instead. So in that sense, one system will not forcibly bring down everything else. And finally, the last thing that I actually want to talk about is steady state. So say you use all of these patterns. Your systems are staying up. And nothing wrong can happen. Actually, that's not true. If you have to fiddle your systems manually, like if there has to be a human intervention to make sure your system is going on for weeks, like restarting them or something like that, that actually introduces a chance of introducing the error into the system. So what you want is as little as a human effort as possible. And there are a lot of things about it, like you can set up a deployment and all that. But there are two specific points that I actually want to talk about. First is have log rotation in place. So the worst thing that you want is that you have a log which are weeks old. And one day you realize that your service is out of disk space. At that point, because of logs, because there's no way to log further, it could bring down your entire service on that host. So set up a log rotation. It takes five minutes to do it. And a lot of folks don't do that. But there's something that never actually makes it to the first draft of the system. It's an archiving strategy. So the way archiving actually works is people will have a script, and the DBA or someone like that would actually archive the data for you. And that is really terrible. Because depending on your system, based on that, in case of Flipkart, if the order is delivered or if it's customer canceled, those are terminal states. Like at that point, nothing else will be done in that order. We know that nothing else could be done about that order. At that point, we could archive any data associated with that particular order, any unit, anything. So your archiving strategy is highly dependent on your domain. And that's something that you can always think about when you're actually designing your system. Because once you have your schema set, once you have everything set, you can introduce a different kind of archiving strategy later on. So lastly, I want to end this talk on this quote by Michael Nigard. So Michael Nigard actually wrote a book called Release It, which is a Bible about building resilient systems. So he says that software design actually only talk about what a system should do. It doesn't address what a system should not do. And to actually build a resilient system, it's very important that we also think about what a system should not be doing. And putting it together, what we want is we want to fail fast if we realize that with the call we are making to the system, it's going to fail. Then we want to fail fast. We also want to bound our resources, use timeouts, at least discover what are the different timeouts by the libraries you're using. Use circuit breakers at every integration point in your system. If you're making a call to a different service or something, use a circuit breaker. So if that system goes down, you can clearly use a fallback instead. And that fallback could be a cached value or stale, or it could be just failing fast. And finally, you want to isolate your failures. You want to use bulkheads. And make sure that if one service is behaving badly, the failure could be contained to just that. And it wouldn't affect other systems. So yeah, that's it. That's enough.