 The talk is going to be a discussion around understanding failures in complex distributed systems and specifically on the cloud. And this, it's very close to my heart, so I was very excited when I was asked to talk about chaos engineering and failures. I played roles of an SRE platform engineer over the last couple of years and have spent a good deal of my time seeing every part of the stack break down in front of me. I was very excited to see on during times when I least expected them. So this is something that I'm really passionate about. So we are going to be talking about failure today, right? And at some stage during the talk, as we discuss failures, it can become very depressing. But I promise you that there is light at the end of the tunnel and hang in there with me and we will talk about how to get around failures and how do we write complex distributed systems and which don't fail and which are resilient and highly available. I work at a cloud infrastructure company in San Francisco. It's called HashiCorp. And we write software for the public and private clouds. So how many people have heard or used about, used vagrant or console? Okay, so quite a few. Yeah, so HashiCorp builds both of these tools. I am the lead engineer on a cluster scheduler called Nomad. And before HashiCorp, I have been working in the platform engineering group at Netflix, where I first worked on the monkeys, the Simeon army stuff. And later I started working on Apache mesos and built a distributed cluster scheduler specifically for AWS. So first of all, I want to go about you talking about correlation of failures with scale and how it correlates with change as well. The nature of an organization, I think, the type of business, culture, et cetera, influences what kind of failures we can expect to see. Scale and rate of change are the two primary reasons that dictate failures. By change, I mean changes in software, changes in configuration, new features, rolling out to users, et cetera. And scale basically means the number of users using a service, the size of compute infrastructure, et cetera, right? So we have enterprise IT on one corner of the graph. And things usually are moving very slow. Everyone usually plays by the book in enterprise IT. There are things like change gates, release trains, et cetera. And the scale is usually small for things like internal applications. And the scale is usually low because they have tens of thousands of users in most cases. And the fleet of servers serving applications are also low. Things usually work well, except for things like planned downtime, et cetera. Like database administrators taking the whole application down on a Friday night or on a weekend because they're doing maintenance. So we see outages like that. But other than that, things move fine. Telcos are interesting. They're a massive number of hardware. The rate of change is less frequent, not because of the culture, but because of the nature of the business. Like you don't hear software on your routers or on your edge gates like changing every now and then. They don't need a lot of change. The failures that we see in places like a telco are usually like hardware failures, like routers going bad or like a switch going bad. And then they replace all that. Startups are a different beast. Like we work all day and churn out code and hoping users are going to like them. And so we move super fast. The scale depends on how much success the company is having. But usually in the early days, the scale is not so much, but change is really happening a lot. Because as in when the business is trying to figure out what they want to do, they're releasing newer versions of their software. And then comes a web scale. Companies like Netflix, Google, Facebook, et cetera, where features are getting churned out all the time. Clusters are getting created and torn out all the time. In those kind of places, we see everything breaks. We see software failures. We see hardware failures. But still the end users get really high quality of service from the services they run. Like usually you don't see Facebook going down. Or when you're watching Breaking Bad on Netflix, like you don't see Netflix going down. So today, this is the place which is the most interesting part of the graph, the top right side, the web scale scenario. So we will concentrate on web scale for the rest of the talk. We will see how failure happens, how we can make them part of our life, but still run services which works really well. So something interesting has happened with the hardware over the last decade. Instead of investing in reliable mainframes, we are buying compute power from IS providers like Amazon and Google and Microsoft as well. And compute infrastructure on the cloud is made of community hardware and network. Like you don't really hear IS providers telling you that oh, we are going to give you a really high amount of availability on a per node basis. And thanks to the moods law as well, we are not seeing processors getting faster, as fast as they were getting 10 years back. But we are running more number of processors now. Instead of every machine having one or two cores, it's common to see 32 cores or 20 cores on every machine. The future belongs to software which can distribute workload across multiple cores. And if you look at NUMA hardware, they are more or less like running processes across networks, just that the cores are connected by faster interconnects than traditional networks. So now instead of running a few reliable servers in the data center of 2016, we are just running tens of thousands of unreliable servers, and which is cool. We'll see how we can tame them. And something similar happened on the application side as well. Instead of writing monoliths, we are moving towards service-oriented architecture and microservices. And distributed systems are pretty much like the new normal. Like there is no getting around to building distributed systems if you're doing service-oriented architecture or doing microservices. People who have been doing a web scale since the early 2000s, they have been doing SOA pretty much for the last decade or something. It's just becoming more and more mainstream now. So we are hearing more people talking about microservices and so on. But it's nothing new. And with microservices, typically an end user request fans out to 30, 40 odd services over the network. This is how things used to fan out back in the day at Netflix. Netflix runs like 300, 400 microservices to power the streaming control plane on AWS. So once a request gets in in a data center, this is how the request fans out. Like this, all those things, all those dots out there are actually individual Meteor services or nodes which are doing caching and things like that. So I have typically heard people say that monoliths are like a single point of failures. I mean, it's true, but in the new world, we haven't really gotten rid of our points of failures. Instead of a single point of failure, we now have amplified the number of ways things can fail. Like that's the reality. Like there is no silver bullet that microservices are going to make you more reliable. Instead, now you are going to be dealing with other kind of issues. But again, at the same time, with service-oriented architecture and microservices, you're going to get to the scale at which you can run things for thousands or tens of thousands of users or millions of users. So there are no silver bullets in life. And also in computing, there is no silver bullet. But if there was a silver bullet, caching would probably be the closest. So caching is something that people do like to make things fast. Like it's very common. Something is running slow. OK, put a cache in front of it. Put a memcache d or redis or something in front. And things just magically become fast. But things also fail a lot when caches go haywire. So let's see what happens. So in the steady state, most of the user requests would be served by a backing cache server. And the cache miss rate is usually small, like 1% or maybe if the caching algorithm is bad, maybe it would go up to 5% or something. But usually around that, right? So in a nice kind of a night when the general user is using your website, you will see your caches are working like a champ. And your origin servers, where most of the business logic resides, are not getting a lot of hits. But once in a while, for the cache misses, what would happen is the request would go to the origin servers. And then reads would happen on the origin. And we would write the data back to the cache so that from now on, the data is available. But what happens when things go a little bad, right? So on Amazon, at least, the network is very commodity-grade. There is a limit of how many packets per second you can do. So if you're running a cache server like memcached and you're heavily relying on it, the point of failure is not going to be the amount of RAM that a machine has. But maybe the amount of pp, the packet per second that that machine can do. And different instant types have different packet per second and everything. But say something has gone wrong, one of your caches are not doing really well. And you will see the normal 1% cache rate would become 10% or 30% and stuff like that. And what would happen is, if the backend origin server has a database behind it, which is very common, services need a persistent store. And if you have Cassandra or MySQL or Postgres or something, and the amount of writes that the database can do is not that high, you will see that the origin servers are becoming slow. And what would happen is, because the origin servers are slow, now the edge servers are going to get slow. And if you're running a threaded server at the edge, like Tomcat or JT or something, you will see that the number of busy threads are going to be increasing. And what would happen is the load average on those machines are going to increase and catastrophic. You'll get a page. So that's how failures cascade. Even if there is a failure that is happening in isolation in a modern distributed system, failures are going to be cascading. It's like a snowball. It just continues to become big and big as it rolls on the hill. So applications which uses distributed architectures have dozens of dependencies. As I was showing you in that chart where requests are getting fanned out, each of which will inevitably fail at some point. There is nothing that can prevent that failure from happening. If the host application is not isolated from these failures, it risks being taking down with them. Even if all the dependencies are doing really well and if there is one dependency not doing well, it doesn't matter. The end user request that you are going to be seeing is going to get slowed down. So you might have 99 services which are doing really well, but one service which might become slow can take down the whole system. So what it means is that, again, if you are using a threaded server at the edge, you'll see load averages and stuff getting high. And you'll quickly see something like this. When the volume of traffic goes high, the single backend dependency becoming late in can just cause all the resources to become saturated in seconds on all the servers. Every point in an application that reaches out over the network or into a client library that might result in the network request is like source of a potential failure. If you are making calls over the network and if you are just fanned out, every single calls are now a point of failure. And one of my favorite books in this domain is Drift Into Failure. It's by Sydney Decker. And one of the key takeaways is that we can model and understand in isolation. But when released into competitive, nominally regulated societies, their connections proliferate and their interactions and interdependencies multiply, and their complexity is mushroom. So even if you are writing really nice services, you think you have nailed it by writing rails or using whatever your favorite web framework is. And you're writing each of these microservices really well. But it doesn't matter. When you run all of them together, you'll see really complex behavior between them. And you'll see different ways of the system is failing in various different ways that you might not even have imagined. And that's what pretty much this code is. So we must design for failure. There is no getting around that. And resiliency is by design. I have heard people say things like, oh, we are feature complete. Feature complete really doesn't mean anything. If you're running a service, if you're not a consultant who is going to go out after writing your code after 40 days, feature complete really doesn't mean anything. Feature complete just means that you have maybe finished 1% or 20% of the contract. But most of the time and money that gets invested in software is once the software goes life. It's like 60% of the investment in the software. So the other questions that we should be asking as infrastructure people are like, what are the implications of dependencies failing? Have we, as a team, taken care of that? Are we making sure that all the network calls that we are making are going to be isolated and whenever a dependency fails, the whole system is not going down? What happens when there is a surge in traffic? Like I hear this from people who run stuff on Amazon, is that, oh, we are going to scale up when we get 40% more traffic. But it doesn't work like that. If you have a database beneath a stateless service, what would happen is the database won't be able to scale up as quickly as your stateless web servers are going to scale up. So by scaling up, when you have more traffic, you are actually doing more damage to the system. So as a team, we need to think about what we are going to do when traffic goes high. Maybe the right answer is that we do a lot of throttling. It's better to have 70% of the users having a good experience than having 90% of the users having degraded experience. And how quickly can the system recover? When we are on pager, this is something that we need to be thinking about is what is our mean time to recover? MTTTR, how do we mitigate those failures? How does the system grow? Initially, when you roll out a service, it's easy. You get tens of users. But at some stage, you're going to get tens of thousands of users. And if you are not thinking about that from the get go, it's going to be a challenge later on as you have to change the architecture and how you do web operations, how you run your services. And what are the implications of data centers failing? Just yesterday, I think we heard Google's interconnects getting failing. And GCP was down. If you have run any infrastructure on AWS, you probably don't trust US East One a lot. You will hear US East One going down every once or two years, which is fine. But as web infrastructure people, we need to be aware that those kind of things can happen. And feature complete is really not the last mile of our journey. So what are we talk so much about failures? What is the best way to prevent failures? The best way to prevent failures is to fail every day and try to write systems which can heal on their own, which needs very little manual intervention and where failure is part of the culture, of the engineering org. And we constantly try to see how things fail and even make systems fail on our own so that we know how things are going to be when finally things fail and we are not watching. So this brings me to chaos engineering. This is something that has emerged in recent times. I think when we were at Netflix, we started with monkeys and stuff. And it's sort of caught up. I'm going to talk about the monkeys a little later. But chaos engineering is the discipline which people are sort of classifying all those failure injections and how do we make our systems fail and make them better? So chaos engineering is that discipline. So how do we do failure injection? Firstly, I want to say that faults that happen across the network boundaries are blurry. It's very hard to reason about why a failure has happened. So if you have a service B that you're relying on and if you see that the service is not responding or the latency has increased, there are so many reasons that that can be happening. Either the network is having issues or there are some hardware failures happening or the JVM is taking a nap when it is DCing or there are kind of like so many things happening. So it's not really easy for us to distinguish between hardware and software failures. And the easiest way to introduce failure in a system are like hardware failures. You take down a machine or you change your IP table rules or net filter these days to introduce some packet loss in your system. And so it's easy for us to see how systems can break when we are relying on a network call on that service and just by introducing some hardware failures. Like we'll see like 70% or 60% of the things that we're going to see when what happens when software fails just by making failing our hardware. Like failing a switch is so easy, failing a router is also easy. So this is easier than making the JVM forcefully take a garbage collection and then see what happens. And there is also this, there is also something that can be done is introducing failures very slowly. One way of introducing failure is shoot a note down and see what happens or shoot a switch or like shoot a router and see like how the overall system works. But this is not usually like how things will play out in production as sometimes like we will see that our dependencies are slowly failing and that's like more realistic. So the way to do that is like using net filters, IP tables and things like that. You actually drop packets on the floor but not drop 90% of the packets on the floor at the first go. Start from 5%, start from then go to 10% and then see like how the system behaves. So if you have seen the work that Kyle Kingsbury is doing around Jasmine, so they do similar stuff like they change IP tables and things like that to introduce network latencies. We wrote something called latency monkey but I'll talk about that later. And also failure injection without any auditing and monitoring has no value. So if you're using like a monkey to shoot down nodes and see like how your system is behaving without monitoring, without auditing, like you're not going to get much value out of it. Unless like all the ducks that are moving in the system are being monitored, it's sort of like not so useful. So we, so to introduce failures at Netflix like we wrote a Simeon army and Simeon army is basically like a collection of monkeys that runs randomly across on our AWS clusters and introduces failures. The first one, the most famous of all the monkeys is Chaos Monkey probably because it was open sourced. The first open source, the monkey that was first open sourced and it shoots down a node and that's pretty easy. It works, I think now they have plugins for Google computer as well. So Chaos Monkey will work and it will randomly during business hours like it will shoot some percentage of your node. It's kind of interesting because when you start off with Chaos Monkey you could say that oh I'm going to kill everything but if you're shooting down a database cluster or like a zookeeper or something it's a totally different ball game. Don't do it on week one or month one. That would be my suggestion. And then comes, okay I didn't change the name of my slides title. But next comes Chaos Gorilla which basically takes down like a whole data center. Like what happens when a data center goes down? What happens when like a whole region goes down? So Chaos Kong would basically destroy like a whole region and it would basically test out like what happens when for example, US East One goes down and can we still like recover from that massive amount of failure and steer traffic elsewhere in our, across the globe. Then comes Lerency Monkey. Lerency Monkey is the monkey which introduces latencies. Lerency Monkey is not very flexible. It doesn't use IP table and stuff. It binds with Netflix's load balancer called Ribbon. So if you're using like a software load balancer then Lerency Monkey would work with that software load balancer and then make certain request fails. But it doesn't work at a very low level and I would recommend using IP tables then using some kind of monkey which works at the higher stack. And then Monkey Commander. This is actually like something that I wrote when I was there. Monkey Commander coordinates all the monkeys, right? What happens is if you're running a Chaos Monkey on one node, you also don't want to run Lerency Monkey on a service which is running on that node because you want to see like how does each of these failures play out in isolation and then sort of come into a conclusion of what happens when things go bad. So you need like a top level control plane to control all your failure injection that is happening in your organization. In the case of Netflix, it was Commander Monkey. We wanted to call it banana stand but I don't know how we ended up with Commander Monkey but I think it's cool. And so let's revisit all the failures that can happen, right? So node failures are basically something that you will see happening very often on public cloud at least on AWS. So the way to solve node failures is that deploy clusters and not nodes. And so there is a common saying in infrastructure, cattles over pets. So what it means is that you don't like babysit a server. Like you don't give those servers a name. You really think about your software as like a cluster of servers running and then state should always be at a service level. Like you shouldn't have like one database server which is backing like five nodes of your web server. So if you're using a persistent medium, that persistent medium or that persistent service should always be deployed in a cluster where there is replication happening and all that. And unleash Chaos Monkey to see like whether you can really withstand node failures. Switch failures, you don't really see switch failures on the public cloud. And even if you see switch failures on the public cloud, you'll see them getting manifested in a different way. Like you'll see like packet per second dropping. You'll see that there are network partitions happening. So the way switch failures manifests in different environments is like differently. So if you are on a data center, like the easiest way to withstand switch failure is like spread services across racks, use smart load balancers, rely on service discovery and unleash latency monkey to see like what happens when there is a network partition. Data center inter-connect failures, if you're on AWS or any other public cloud, you'll see that connectivity between zones will go down and you'll see like network partition between zones. So it's, again, it's easy. Spread clusters across data center, don't rely on the fact that there is a high speed network connectivity, mostly like a fiber and don't assume that it's going to be always up. When it's up, it's up, but when it's down, like you need to be ready for that there. Load balancer should basically fall back to like a healthy data center. Like if you have a load balancer running on the edge, like you need to make sure that load balancer can fall back between data centers. And the easiest way to do that is to use something like console. In the case of Netflix, we had Eureka, but it's pretty much the same. Like if you use console, like you can fall back between data centers and stuff. And if you're using HAProxy backed by console, then HAProxy can route between different data centers as well. Region failures, run services active active. If US is one, Virginia goes down. It should not be like, you know, the users are also not getting the services that you're providing. The best way to do that is use a geo DNS service. Like I think Dyn is there. And then I think route 53 allows you to divide users based on their geographic locations as well. So you need to basically have a control plane in place, which is going to take people from certain geographic locations and route them to a specific AWS region. For example, or a Google, Google's region or like, you know, your on-prem data center. So DNS should not be like a black box where everything is getting routed in one place. It's very crucial to have like DNS move traffic from one data center to another. And then unleash something like Chaos Kong and see like, can you take region failures? DNS failures are the worst. Last year or two years back, I think when ultra DNS was had that DDoS attack, I was pretty much like twiddling my thumb. Like there was not much that you could do there, right? Your only way to like get around DNS failure is like resolving TTLs and things like that. But that can happen. Make sure that like you can change like your DNS records from one service provider to another. It's not nice because most DNS providers don't play well. So their APIs are not that good, but you have to be ready for that. Like it's like, it's just really bad. You just need to know like who to escalate like when that happens. I don't have a good advice around that, but I have been in that situation. So, and we had to move our records from one provider to another. So Chaos engineering should also be applied to human resources. If your colleague is not around, it should not be like there is a production failure and the team cannot handle the failure, right? So if you are a team, if you're a team of five people, you know what we used to say, let's play Chaos Monkey on people. Like let's ask someone to go on a vacation and see like how we are doing. I mean, of course it has some cultural issues like not every company might be cool with it, but I think at the end like it plays off. It just makes sure that everyone knows like how to run maintenance on nodes, how to run maintenance on a service and things like that. And prepare run books and dashboards and not rely on tribal knowledge. Like it's very easy to rely on tribal knowledge. I'd say that my coworker has written this subsystem and only he knows how to handle that kind of a failure. All those things, all the knowledge that people have should be in run books and not in gray cells. So now as I said, we are going to see how, we're going to talk about how now that failures are happening and if you agree with me that failures will happen in production system, like we'll see like how do we build resilient systems. So these are some of the things that I have seen very helpful to build resilient system, to build highly available systems. So the first thing is very, I like this a lot is reactive load balancers. What it means is that if you have a load balancer like Nginx or like HAProxy which does round robin or like shuffling and stuff, that's not enough. That's not going to get you through like a Christmas evening surge in traffic. Like it's going to be like one day like Nginx or HAProxy's routing logic is not going to be capable of like seeing through like a bad night. So one of the things that I think is very useful is that to score nodes based on latencies. Like how good a server is should not be based on where it is, what the index of the server is on a list. We should always measure how a server is doing from a load balancers perspective based on latencies and also what it means is that if some server is doing really well, it doesn't mean we move all our traffic there, right? Because then you are going to see like a moving hotspot in your cluster. Like in some cases like some servers are going to do well and then as you move more traffic to them, those servers are not going to do well and you'll see like some other services, servers doing better. So what it means is that you shuffle those servers and between after you have shuffled those servers, then you apply your logic of like latencies and cascade multiple algorithms. Casket multiple load balancing algorithms and whatever is the result, like apply it to the, for the load balancer. And then parallel requests. This is, there is actually a good talk by Jeff Dean of Google about like how Google takes care of the tail end of laden for end users. So what it means, what it basically means is that when a load balancer requests or when a client requests, it doesn't request one backend server for the response. It asks two backend servers. And then what it means is that if between the two servers, if one of those servers is, for example, doing a GC or something, like at least the other server is going to respond. So the 99th percentile latency becomes like really smooth with that. And what you could do is that then you can cancel one of the requests when you have made parallel requests. And then next comes request cancellation. What request cancellation basically means is that if a server is taking too long, then there is something wrong going on with the server. Like cancel the request after you have spent like, I've, you know, like 200 milliseconds or 300 milliseconds or whatever is the threshold, right? Once you're done, if you're still waiting on that, what it means is that you are occupying some of the threads on your system to talk to like a server which is running slow. And what would happen is you are going to get slowed down because of that slow server. So request cancellation is really important. And not, I don't see a lot of load balancers implementing it. We have, we had implemented request cancellation at Netflix. I don't know whether Twitter's Finagal does it, but it's Tubby from Google, which is not open source, but that has request cancellation as well. And, you know, I talked quite a bit about threads. I talked about threads getting busy and bad things happening on a server. I have seen work, what worked really well is not use threads at all. Instead, like use non-blocking IO. If you're on Linux, you could use EpoL or if you're in a BSD, like you could use KQ. Servers like Nginx are already non-blocking, but use non-blocking systems and not use a request response kind of a mechanism because when you're making a lot of requests at scale, the kernel is going to be creating a lot of TCP connections and going to be tearing down a lot of TCP connections. And you are going to be seeing like a lot of churn in the networking stack of Linux. So what you could really do well there is create a connection with your backend servers and then never close it. And then use a streaming protocol, like if you're using Java, then you could use RX Java, for example, and use a stream-oriented protocol. Or if you're using Scala, you can use ACCA, which also does streaming really well. So if you're using any of those tools or any of those platforms, then you can leverage streaming protocols and not have the kernel do a lot of churn. And I saw that works really well. Circuit breakers, like I was saying before, every point in an application that reaches out over the network into a client library that might result going out of bound of the process is a source of potential failure. Words and failures, these applications can also result in increased latencies between services. And these issues are actually exacerbated when network access is performed through a third-party client. Like if you have a Cassandra client or the data stacks Java driver or like a MySQL client and stuff, you don't know how that library is going to be talking over the network. What if they don't implement timeouts really well? What if they don't do request cancellation? What if they don't do streaming? So what I've seen work really well is that put like a circuit breaker on top of the client library, right? Network connections will fail and degrade. Services and servers are going to become slow and new library changes are going to always happen. Even if you see that a client is working well, you don't know what the developers of that third-party client has done in the next version of the library. So always predict your calls over the network via circuit breaker. If you're using Java, there is his tricks. If you're using some other language, I'm pretty sure doing something like his tricks is not too hard. But basically use fallbacks. If something is taking too long, instead of getting the response back, use a fallback, do bulkheads and do aggressive timeouts. What bulkheading basically means is that if for example if you're on Java, if you're running using a thread for every request, then you are probably going to be using a lot of threads as the number of requests comes to that server and what would happen is the server is going to get busy. Instead of that with bulkheading, you can limit the concurrency. You can say that I have a pool and that pool is always making requests, using the client libraries and making network calls. So that way if you're using a threaded server, you can sort of save yourself against those issues. And timeouts, I think I've already talked about it. If something is too slow, there is no point in trying anymore. You just give up. Dynamic cluster schedulers. I mean I work on dynamic cluster schedulers so this has to be there in the slides. Use a dynamic scheduler like Nomad or Kubernetes or Mesos to restart services when they fail. Use a cluster scheduler for faster deployment and utilizing all of your resources and for better QoS. Recently we did a test on Nomad, which I work on, it's a dynamic scheduler and we ran a million containers under five minutes on Google compute. So that's like the speed that we are talking about. So instead of a traditional configuration management-based system, cluster schedulers are drastically going to improve the amount of the speed at which you can deploy or roll back. Service discovery systems like Consul are the fabric of your data center. They maintain the registry of all the services and they should be dynamic. There is no point in having static service discovery. Static service discovery is basically like IPs in a HAProxy configuration. So instead of IPs in a configuration like that, use a modern service discovery tool which can add and remove services from the load balancers really fast. And the service discovery system should be prone to failures as well. Like we have seen service discovery system like Eureka go down and take a lot of our traffic down as well. Service discovery systems have to be used in such a way where you can withstand their failures. Like if you're using Consul, which is a CP system, use stale queries. What stale query means is that if a system is consistent as well as it cannot tolerate any partition, like when it actually sees any partition, like you are going to make stale queries and ask Consul that hey, you know what? I don't need the state of the world. Tell me what you knew five minutes back. Because out of those five servers that you might tell me is present, I might be able to use some of them even if I don't know the whole state of the cluster. So reactive scaling. Scale out in reaction to latencies and QPS. I think in traditional data centers, like people over provision all the time, if you're in public cloud where things are expensive in general, but you need to be scaling out and scaling down based on traffic. So reactive scaling basically means that you're scaling out when QPS goes high and it also means that when there is a thundering storm, you need to be throttling, right? Like as I said before, 70% of users using the service is better than not having anyone use the service. Lastly, immutable infrastructure. What it means is that, like I said, don't give servers names. Servers should be treated as clusters and not something that you would have like an emotional attachment. And infrastructure should be considered disposable. Anything can fail. We should be ready for everything in the system in our fleet to fail and recover from there. And lastly, the thing that I tell myself is that hope is not a strategy, right? There is nothing like things are going to magically work. Things are going to fail and when things fail, like we need to know like how to get back on our feet and keep the servers running and keep the services going. So thanks. And these are some of the readings that I recommend. Notes on distributed systems for the young blood. If you are learning about complex distributed systems and failures related to complex distributed systems, nothing beats than reading how distributed, some of the principles of distributed systems. I didn't get into like CP systems, AP systems, but those are something that I think as infrastructure people, we should know about. And then I like this book a lot, Drift into Failure by Sydney Decker. Like it talks about complexity. It talks about how things fail when they are used together. And there is this paper by Richard Cook how complex systems fail. I quite like that too. So thanks again. And let's open up for Q and A now. Hi. Hello here. Yeah, thanks for a great talk. The question I wanted to ask is can you share any of the horror stories? How bad did one of these experiments go? Oh, the breaking down things. When you were trying to do some experiments, how bad was the failure? I mean, things are pretty, I mean, so here's the thing. The most common failure is that like we were trying to do a chaos con and it's like very common. It used to happen to us over and over again. And so then we developed a tool, but this is what used to happen. So we used to try to evacuate, say for example, US is one and we would want to see like how does, how does things work when we are in US West 2 and all our traffic is in US West 2. But then there will be like one team which would not know that we are doing that experiment and they would start a deployment. And that deployment would mean that something has changed in the configuration and it would basically break our experiment and we would have an outage. And the other thing is that like if Amazon and stuff like servers don't fall from the sky, it's someone's data center that you are running on. So when you are increasing your traffic on one region and getting traffic out from one region and scaling up that region by 2x or 3x or 10x, like sometimes your cloud provider might not even have that much machine. And what would happen is you have moved all your traffic now and now there are no machines. And automation failure in, for example, your persistent storage, right? You shoot down a node, you think that it's all cool and then it's not cool because something broke like your automation system couldn't handle it. And then you would see like services going down and you'll see like cash rates going high and you'll see like for 15, 20 minutes, like if you're a good team, like you'll have degraded experience. I mean, I can go on and on about like outages. It has caused major outages. Yeah, this, all this of course, major outages. But the good thing is that since we have that culture, that systems are going to fail and we build automation around moving traffic. So when something fails, it's easy for us to tell that, hey, you know what, let's not have anyone from New York, New Jersey to be pointing at this data center. Let's move all these people from East Coast to West Coast because East Coast is where like most people live, right? Yeah. So how do you also get into that confirmation bias? You're testing for the failures that you know. Missing out on something that is not obvious, which might cause a bigger issue. Right. You want to hear like a story related to that? Isn't that a common thing that you only... Yeah, yeah, totally, totally, totally. Like we would see that, we would see for example, like one of the services that we think are like really stable and we would have no clue that this is something that is going to cause an outage and we would do these experiments and we will see like that thing is just like taking us down. And also like if you're in a culture where you're shooting down machines like without talking to anyone, like you'll see that some developer who might join recently in your organization. He doesn't know that the system might be down and he might have come from a traditional data center kind of a world where people are babysitting. Like he might write some code which is not going to handle node failure. So we have seen like all sorts of issues there but this actually like helps in the longer run. Hey, so my question is about the chaos con you mentioned. Yeah. Can you tell me how it's implemented? So basically, do you actually shut down all the nodes and even let's say the DB instances in let's say USC's one? Uh-huh. So do you actually do that? Do we kill, do we allow chaos con to shoot all our nodes? Yeah. Is that our question? Yeah. No, there are like blacklist clusters. Like there are things that we know are not going to work well. Like what is the point of shooting them down? So like you mentioned the case for chaos con is that the region goes down, right? Yeah. So in that case even the DB instances won't. Yeah, so here is the thing, right? So like first of all, like I mean when I write systems I don't write systems which can live only in one region. Like if I'm on Amazon, I would use DynamoDB. If I'm, if I want to own my own availability I would use Cassandra for example. Okay. Like I would not use anything that doesn't replicate from one region to another but if I was an SRE and I am given like a service which does not, which cannot withstand the move, then you know like we say to ourselves, hey yo, like we know that this is not something that is going to take like a whole region failing and we're going to blacklist it. And hopefully like in the next quarter or something like we can invest in making it move regions. Like it's very common. Like you have to live with it. Like if you're starting fresh. So let's say you're a small startup. You're setting up AWS, you have an RDS. The basic thing you can do is you can set up multi-AZ, right? Yeah. So I can show that. Exactly. One easy goes down, you are still okay. Yeah, exactly. And if you're also like using RDS for example, like you need to know that you need to be like pushing your backups to like somewhere like S3 or having pushed your backups, not even in AWS somewhere else. So that if everything goes wrong, like you can move like your restore your databases in another region. Yeah. Thank you. A great talk, Ditanu. Thanks. So my question was around, you gave us an example of 300 microservices and your requests fanning out, right? Yeah. So how do you gain confidence in your release process? Yeah, there is, there is just no release process. Like you release all the time and you keep monitoring all the time and you need to have a solid rollback strategy, right? There is nothing like integration testing and waiting for like something to be like working really well. If you, if like, if you're writing software, if you have written decent software, like just release it and release it to like 1% of the users, right? Do a canary, like don't release to the whole world, right? And then see what happens, right? 1% of the 1% of like 60 million people, like it's fine, like the sky is not going to break down. Okay. So essentially you're recommending testing in production? Yeah, exactly. No, I, by testing, I'm saying monitoring and rollback. Okay, yeah. Yeah. In terms of the Simeon army and so on, how do you test how those things function? Like what's the testing for your failure generation systems? Good question. I'm not very solid testing system out there. We just think that it's going to work and we try to, we then basically like what we do is we blacklist everything and then we whitelist only like certain things and then see like how this works and also like we can build like some things like dry run and stuff. So some of the monkeys have dry run for example and then you don't do it in production like on Dave zero, right? Like you do it on your test environment or staging and something and gain confidence and then move to production. Yeah. Hey, back here. Yeah. I had a question. So great talk. And I think like a lot of these principles work really well for systems that scale a lot and have to be up all the time and all that but it all costs money. Right. So like one thing that I would like to know about is when do you make that decision of I'm going to do this all the time? Right. Because just getting this set up and running is itself money. Right, right. 100% and totally agree to something that comes up very often. Like when do we become like Google or Netflix or Facebook? Right. That's pretty much the question. I think it like right from the get go. Like if you have that, if you have that infrastructure in place, if you have the skeleton in place, right? Not use like a load balancer which doesn't have complex or nice load balancing strategies. I use a software load balancer user service discovery, right? Keep all the framework in place so that it's easy for you to like do incremental improvements in your infrastructure. Like that's like, I mean, that's the most pragmatic way to go. And again, like, you know, like I work at a startup too, like there is always a race between deploying features and then there is, there is always the race between deploying features that people are going to write and all the strengthening infrastructure, right? Like you'll obviously come to that place if the business is working well, like you will concentrate on the service keeping up. So I think a lot of these problems, a lot of these discussions about how much money we are going to be spending, like I think goes away when the business does well. Hello. Yeah, since you mentioned about service availability, can you tell some key point which we should take in mind while designing for it? So service discovery is again, like this is, so service discovery systems are very complex, right? Like you need to understand like what that service discovery system is doing. Like for example, Zookeeper, like Twitter users Zookeeper and they use Twitter commons and server set and all that to write data in Zookeeper and then basically use Zookeeper to understand like where things are. And like you need to know that Zookeeper is a CP system, right? On the other hand, like console is also CP, like it's a similar things going on and console uses raft and surf, which is like a variation of swim to basically disseminate information in the cluster. Like you need to know like what is going on under the hood in the service discovery system to basically know like how that system is gonna fail. So now when you know that how the system is gonna fail, now you basically query the service discovery system in that way. For example, if I am running Zookeeper, I am going to make sure that I have at least like Zookeeper 3.5 in production. So that like I have certain observer nodes and stuff like that, where even if there are quorum loss, like not Zookeeper still can give me like read, I can serve me read request. Same with console. Like if I am doing queries again console, like I would do like a stale queries and stuff. I would keep the cluster size to maybe like 7,000 or 6,000 or something and then connect them over the van or stuff. So that is there. If you're using Eureka, which is what we used our Netflix back in the day, it is an AP system. It doesn't have any failures like Zookeeper and console, but there are limitations in how many servers can connect to it. So then you need to size your clusters that way, like which is like 50, 60,000 servers. If you're running more than 60,000 servers then you need to deploy like multiple versions of Eureka. Does that answer your question? Cool. Yeah. So how do you think Terraform here will help in? Sorry, I don't see. Okay. Sorry. Yeah. How do you think Terraform will help in this monkey testing? So yeah. So as I was saying, like I had to slide around the tools, how you build a resilient system is use immutable infrastructure, right? And what happens is when you use immutable infrastructure, you need something to build those machines and to bring up your cluster. So for example, when something goes down, you can use Terraform to make sure that there are same number of servers when something goes down, right? So you can create your clusters using Terraform and those Terraform convicts will work across clouds and stuff like that. So mostly like around immutability, like Terraform and Packer helps. So Packer is what you use for building AMIs. Okay. I know a lot of you still have a lot of queries. Please take them offline to Diptenu. A big, okay. A big thank you to Diptenu for flying down Amit's busy schedule to talk here. Please fill out the video.