 First of all, like Tejesh mentioned, I work at Niranjo. I am also part of the Bandar core team. And today, I want to talk about how to build resilient systems, which is useless, although. So I guess it got tangled. So today, I want to talk about how to build resilient systems. So why do we care about resilient systems in the first place? So if anyone who loses sleep over their systems downtime or any customers or business, they would care about resiliency of their system. Because downtime would probably mean, would be disasters for them. And probably some developers would have to wake up at night to answer the on-call. That is one. So you would want to build systems that in case of failures or a very high traffic, it would still work. It would still perform. And hence, we do care about resilient systems. So how do we go about building one? So let's move away from software for a bit now. Let's say about cars. You have a car. Those are very resilient when an accident occurs. Like it would have an airbag system. And it would kick in and save your life. Nuclear power reactors. If the power goes out, it doesn't just start leaking radiation all over. There would be mechanisms to build it. Do you think those guys build that mechanism once the nuclear power plant was entirely built? It started, it was working. And then somebody said, oh, what if there's a power outage? Do we have handle that case? No, they would have thought about it from the start. But unfortunately, in software resilience, the way the things it works is folks built the entire feature. Everything is done. Luckily, if it's in the QS state, someone would get the idea, oh, what if that gas server that we are highly relying upon to meet our SLS service level agreement? Well, what if it goes down? If you realize you didn't handle that, then that's too late. Most of the time you realize all of this while it's on production. Once a fine day, when you actually need your caches to be working, when your load on your server is extremely high, that's when you would want it to perform its best. That's when the cache server would actually go down. And when it goes down, some devs would get fired. But the rest of who remains would have to handle all that carnage. So the only way to design our resilience system is from the start. You have to begin with resilience in mind. While designing, you have to think about, OK, what if my service I'm talking to goes down? What if my database does not respond? Or what if there's a bad query running on my database? All those things you have to think. And a lot of things, while I'm going to talk about patterns that can help you with, a lot of things are very domain-specific. Those are things that you have to plan out in advance that you can handle in code. So yes, so coming to it. This is going to be the main crux of the talk, resilient design patterns. So while I don't say the guarantee that you, using this design patterns would guarantee that your core machine will not go down every time, but it will reduce, it will improve your systems uptime considerably. So let me give an example why you would want design patterns over a one-time solution. So we had a system in production. Over a period of time, the memory usage of that, we had Unicorn as our web server. Over a period of time, the Unicorn web server's memory usage would increase, like day by day. And by a week or two, the memory usage would be so high that it would start swapping. And at that point, the only option you would have is to restart the web server. And so that was very painful. So while looking at why this was happening, we had a lot of things like, OK, maybe it's one of the native extensions in Ruby, which is leaking the memory. We didn't find anything. We thought, OK, are we creating symbols anywhere? We checked. We were only doing JSON parsing. And that was the only time we were creating symbols. And generally, you would know that most of the times when you parse a JSON, the keys would be non-unique. So that's a better option. So let me just add that those who are new to Ruby, in Ruby, till Ruby 2.2, we just came out on this 25th December, it would never garbage collected symbols. So if you create a symbol in Ruby, it would remain till you kill the process. So it's like you can't just keep creating symbols and expect things to work fine. So what we found out eventually was that at one place where we were parsing JSON, there were non-unique keys, which were timestamps. And every time we were parsing this JSON, we were converting it to symbol. And that was increasing the memory usage. And it was not happening very frequently. But when you add one or two weeks time period, the memory usage would start as well. But till then, to actually solve this problem, you would need knowledge of Ruby's symbol. You need to familiar with your code base. You need to know that, OK, I'm parsing JSON here. And there is a non-unique JSON key. There is a unique JSON key coming every single time when I get the response from this server. All that knowledge might not be possible if the project is very large. You might never know that part of the code base. So till then, to make sure that no customer actually got impacted, we had bounded the memory of each worker. We had something like Monit, which would monitor each worker. And if its memory usage increased a certain threshold, we would kill that worker and restart it. That way, even though we had plenty of time to look for the error to find the root cause, the business was not impacted. And I think this is what design patterns are all about. It's like, till you exactly know what the problem is, you won't lose sleep over it. And you would have sufficient time to do it. If we didn't do that, somebody would have had to be online every time the memory, it would start hitting this web. So these are the patterns I would be covering today. Let's just begin with the very first one, bounding. So in bounding, let's start with timeouts. How many of you know the, like just raise your hands, how many of you know the natural HTTP default timeout in Ruby? Like no one? They are keynote speakers. So it's 60 seconds. Like it's 60 seconds by default. A lot of Ruby HTTP clients are built over this natural HTTP library, which comes with MRI by default. So they're also 60 seconds. Whatever active record connection pool. Say, what's the timeout to get a connection from a connection pool? Like does anyone know that number? Like no one this time. That's five seconds. So what you realize is this timeouts are very high. Having this timeouts on a production system under load would be very bad. I think in the fail-fast pattern, I'll explain why this is a very bad behavior to have. But generally, a lot of systems' default timeouts are very high. But the sad part is there are a lot of things in your operating systems that does not have timeouts by default. Like we, again, there was one software where its job was to pick up all the messages from the services and push it to our main queue. Now, this service was very strange because it would, after two, three weeks, it would actually get hung. It would just stop working. And there would be critical impact because of it because messages were not getting delivered to the other service. When we went down, like we, using GDB and printing out stack rails by trapping a signal, we found out that it happened that it was getting blocked on socket.send. So in Ruby, socket.send is a blocking call. We were using a UDP socket, which is fire and forget if you still remember. So generally, when you send anything on UDP, it doesn't wait for a response. It's instantaneous. But UDP has a buffer. And when that buffer is full, if you don't have any timeouts or if you're not sending in non-blocking way, the dot send would block permanently till someone actually goes down and I mean restarts it or something like that. And because we were so to this one bug, and this was something we used to collect matrices and analyze them later. So something that would actually just be for reporting or something was blocking the main feature of the software which was sending messages. So use timeouts. Evaluate each and every timeout in your system. And depending on your service, maybe the acceptable timeout is only 10 seconds. But it wouldn't never be 60 seconds. Second point is bounding. Like one I've already talked about is memory bounding. Like check your worker process. Check your that if after certain threshold you probably want some kind of behavior. Maybe restart or maybe clear the memory. So that is one. And finally I would say in bounding I would say about this bounding your buffers and your queues. So if your load on your system is very high, all your UDP buffers, TCP buffers, and say you also have Vutex lock in your code base, all of this would actually form queues. And the sad part of this queues are you have zero control over those queues. So you can't do anything if this queue for a form. Instead the better approach would be to have very restrictive buffer sizes for this. And then have your own queue, your own bounded queue for that. Once you have your own bounded queue, you are in control of it. You can easily say apply back pressure. Basically tell the other service which is sending all the message that, hey I am completely old or that, please don't send me more messages or send them at a slower pace. Or you can process those messages differently. But that is something you would only know if you know how many messages are there in your queue. And if it's somewhere in your UDP buffer or it's somewhere there in your Mutex queue, you would never figure out that how many messages are there to process. Second pattern is circuit breakers. So circuit breakers are one of the patterns popularized by Michael Nigard in his book, Release It. They are amazing patterns. It's really sad that a lot of folks don't know anything about it. So let me explain this based on the diagram. So you're talking to your client. Their circuit breaker is just an intermediate thing that sends to the supplier. And you get a response and you again send it back to the client. Everything works. But now let's say there is some connection problem or the other supplier is overwhelmed. So what's happening is while sending the request, it's timing out. You have set your time out because now you're following use time out. So you have time out of say two seconds. It's not giving you a response in two seconds. So you're timing out. And then you're sending that error back to client. In this case, any decent circuit breaker implementation would have a fallback mechanism. In case you would call that fallback, the fallback could be a static content, your cache or anything like that. Or it could be not be implemented. Whatever it is, it would time out. After a certain time, it would realize that you would have set a certain threshold of failure for that API. You would say, okay, in last 10 minutes, it should fail for 30% of the time. If you breach that limit, it would not even, the circuit would trip and it would go in the open state. In this case, you would not even bother calling the other server. This has two advantages. One is you're not wasting timing out to a service which is failing more time than you desire. Second, that other service might be overwhelmed and by sending your request there, you are essentially d-dossing that service, which is both the things that you wouldn't want. And this is like the very bare minimum of service circuit breakers. After a certain time, what you can do with your circuit breaker is you can move them to a half open state where you would try making calls to the service. If those calls succeed, you can again go back to the initial closed state where everything is fine or if things are still not okay, you can remain open and not make the call to that services. Actually, circuit breakers are not just used to talk with other services. Elastic search makes users of circuit breakers. Basically, if you are executing really bad queries in elastic search and those queries are reaching the time limit that you have set via a certain threshold, it will open the circuit for that query and it won't even execute that query. Every time you, in your code, that tries to execute that query, it will raise an exception saying that, look, this query is taking a lot of time and we are not even going to execute it. But this, you save the, so the other queries in your system can still go ahead. In MySQL, there's nothing like that. So in MySQL, if you have a lot of queries running which are taking time, it will steal all the resources from the queries, the other queries that should be executing. And generally, if you can ask any DevOps or Linux admin, in this case, probably he would have to go to MySQL, look at MySQL process list and kill those queries one by one. So the other queries can start executing. So circuit breakers are even used in databases. And yeah, for implementation of circuit breakers, what I really recommend, if any one of you is using JRuby, I highly recommend you use Histrix by Netflix. It's a Java library. With JRuby, you can easily use it inside your Ruby code. But if you are on MRI, there is a circuit breaker gem. But frankly, it's not as battle tested as the Netflix Histrix library. So yeah, fail fast. So now, in both the previous points, I've been mentioning that you should be failing fast. That failing fast is much better than taking time to do it. Why is that? So I'm just going to need some math for this. So this is Lytl's law, which comes under a queuing theory. So your length of a queue depends on how many messages are arriving and how much time your system is taking to process them. So now think about it. Like, if you're not failing fast, and there's something wrong with your server or the service you're talking to, your mean time in the system is going to short up. It's going to be very high, which means your length of the queue is going to increase over time. So I have a perfect example for this. Say you are running a business of, you know, you offer eBooks for download. So you buy the eBook, you pay them through your credit card. Once the payment has been accepted, it goes, the payment service actually tries to approve the payment, and then you get the downloading. So buy the book, pay, payment gets approved, and you get the downloading. Now let's say the payment approved is talking to some credit card company servers and that is down. So by the time it's down and you're not failing fast, what's going to happen is there's going to be a queue buildup till that point. And let's say it took one hour for them to get the service up. Now, once the service is up, unless the response times were really, really fast, what's going to happen is, I mean, if the response times are really, really fast compared to the arrival rate, what's going to happen is the folks who are ordering anything now will actually get the order after one hour or whatever the delay is in that system. So say, so that's a very bad experience because on any eBook site, what you see is like once you purchase the book, it would take one minute for the download link to appear. And you're breaching those promises every single time. The customer might never even come back and buy things from your site. But let's say you believe in failing fast. You say that, okay, I'll fail fast. So every time we try contacting that other credit card service, it fails, and you have a circuit breaker implemented, you go to fallback and in fallback, the only thing you do is you push that message to a different retry queue or something and say, okay, I'll retry this message later and you clear it from the main message. What does this do is when the service gets out, the people who are placing orders at that moment will actually get it right then and there, like the one minute time that you have promised. And those folks who did not get it in one minute, you can send them a mail because you can say, okay, all the messages in the retry queue are the ones where I actually preach my promise that I promise in one minute and you can send them a mail saying, okay, because of technical difficulties, it's going to take some time and you can also process those messages. So again, if you have noticed in three patterns, what I'm trying to convey is control. What you get here is control. Instead of allowing your operating system or someone else to have the control, it's better that you have the control because you understand your domain logic. You know, if you have an e-book buying website, you would want to have the control to send apology mails or not. Having said that, there is one thing that I really want to show you is this one second. So this is queuing theory and because there was a capacity talk just before this, I just want to quickly show you that how you can actually use queuing theory to do amazing things. So say, let's say you have a single queue. Basically, you have a website where you're getting all your messages from. You have C servers. So let's say C servers would be 15, 20 servers. You would want, you're getting 100 customers per minute, 100 requests per minute and your rate is around 15. You can process 15 requests per minute per server and you have 20 servers like this. This tells you what is your average type of customer, how many customers you serve in a minute. It tells you your load average, which is zero because this number is really high because there are a number of, so let me reduce it. So yeah, so average waiting for each customer to actually get your product is 37.0037 minute. It also gives you entire graph about how much time it has to wait. It can really help you with the capacity planning. So yeah, so going back. There's something called a bulkhead. The bulkhead design pattern actually comes from ships. So if you look at any modern ship below deck, there would be, they would be compartmentalized completely. The reason why they do this is, say your ship meets with an accident and water flows through it because it's completely compartmentalized, the water won't invade the other parts of the system. So even if this one part of the system is down, it does not take out the entire boat with it. And this is the same in services. See, you have two services, A and B, and they talk to service C. Now, unfortunately, the service C goes down. Both A and B making requests to C are also down. And because now A and B are down, service depending on them would also go down. So you can see this is a cascading failure scenario, like when you have multiple microservice running in your system, in your service-oriented architecture. So in this case, this is something that you clearly don't desire. But let's say you use the bulkhead pattern. You have server C, some of the dedicated servers, dedicated for service A, and some are dedicated to service B. For good measures, see, their physical loads don't also go down. So you have put them on different regions of AWS. So with this, even if service A portion of service C goes down, other services will be still be up because of redundancy that you have introduced. However, bulkheads are not just for services. It could be applied to your single machine as well. So say you have a machine and you have multiple threads running. You can pin a certain thread group to a core and other threads to other core. So even if those threads, that part of the service which is using those threads, start to get overloaded, it won't at least make use of other cores. It will only use the cores that it's pinned to. And that way, even though some part of your machine is constantly busy, the other parts are still able to serve requests. And that is what the bulkhead pattern is. And finally, the last pattern that I want to talk about is specifications. Specifications are something that I feel a lot of folks have realized that they are not important. But I think Leslie Lampert gave an amazing talk last year called Thinking for Programmers. You should definitely watch it if you haven't. In specifications, what he says, before you write any code, you write the specification in a stateful manner. So you have a function that takes an argument A and it changes that A to some state B. And if you document that you've write this before you do it, you would realize that how sloppy your thinking is. Because what happens is that whenever a person thinks about a case before, when he starts coding, he only, at most, he's only going to think about the happy case when everything is working fine. He would never consider the cases when something might be wrong with the system. So with the specification, once you write this down that this is what I want, you realize that you haven't thought about a lot of cases inside your system. And then that's when you realize that you need to think about the not so happy case. So one of the quotes that I have to say is that by Michael Nagat, the software design, only talks about what systems should do, not what systems should not do. And because we only address, if we only think about the happy path case, I can assure you when on production when you actually encounter a scenario that you didn't think about, things won't go well, you have, you need to think about it. This is before you write any sort of code because even if you do TDD, the amount of tests coverage that you would actually do is based on scenarios that you have thought of. If you haven't thought about a problem well, if you don't really understand any code, understand the problem that you're trying to solve completely, no matter what patterns or whatever you lose, you will never be able to write a good code that would actually solve it. And finally, just putting all together all these patterns, I would highly recommend that, if there is one pattern that you can start with, I would recommend that you start with specification. And some of you might think that, maybe my services aren't large enough or big enough to make use of any patterns. But I digress that even the smallest service today is a distributed system. You, at least in the worst case, would have a database running on a different server. And you would be communicating through network. And as we all know, the network is not as reliable as we want. So I would say that at least put some of those patterns in practice. I think you can ask any developer who has been on call for a long time. It's not a really amazing thing to be on. So yeah, that's it. That's my, and here are the references. I'll upload with slides. So if you want to know more.