 reliability of distributed systems. So what I want to cover in this talk today is, so how many of you have this question? What is DevOps versus SI? Anybody else? Everybody's pretty clear about the difference. So I'm going to talk mostly about distributed systems and what are the failure points that you see every time you are using something. So whenever you're making a product, every product has is going to hit some scale issues. Now either it didn't live the life enough to hit scale issues or it just dies off. And primarily what I'm going to do is I'm going to take a sample application and in that application what usually people see and what I see when I see systems, when I see distributed systems. So I'm going to take very simple pieces and I'm going to evaluate how are they going to fail and how they fail and what exactly is failing and how to go about fixing. So four things that I want to focus on firstly customer empathy. Now this is not a product talk, but at the end of the day, everything that we do has to do with our customers. It has to be empathy towards the customer. Customer in this case is a product which has been deployed and we are building it as a reliability. This is a Hindi word, no children. So what I typically mean by this is that a lot of people come and tell me that hey, give me an answer on what works. And my usual answer is I can't tell you. I can give you the options to pick from, but I can never tell you what's going to work because they're going to be trade offs. There's a cost to everything that you do. You may pick up one solution over another. It may work for you, but may not work in second case. Lastly, architectures are always going to adapt to dollar priority of your business. Whatever we may do as engineering always has to align with business. And a dollar value has to come at the fear of sounding a capitalist here. So this is a sample product. I'm going to talk about. Let's say we are building a simple application where the user sends a SMS and says, do remind me at the XYZ on YPM about me. Just remind me and I'll call you back. So sample flow would look like in the product says user sends SMS. There's a service which receives an SMS. There's a cron job which is running, which finds out that, hey, this is time to call someone and we call it fairly straightforward, as simple as we can. I'm going to split it into two parts. First, the part is of receiving the SMS and second is a part of sending out the call. Just break it down to two parts. So the first part again would look pretty much something like this. So let's say I'm not endorsing Twilio here, but let's just say that you decide to use Twilio for receiving the SMS. So the application flow would look like there's a Twilio user sends a SMS to receive an STTP callback on your service. The response sends out 200 saying we were acknowledged a message. We received it. Would this get any simpler than this? Does anybody want to suggest that, hey, this could become simpler? Now, when you look at this, I'm going to take a quote from Leslie Lampott here. That is, what exactly is a distributed system? It's a classic definition. Distributed system is one in which the failure of a computer you didn't even know existed and render your entire computer. This was a long time back. And I don't know how they came up with so much clarity, but this almost sums up what a distributed system is. Usually four flavors of failure. CPU is going to go back. That's not a failure, right? Like last time when a CPU hit 100%, did any of your applications fail? Probably not. CPU doesn't fail. I mean, unless the processor goes back. Memory fills up. Memory fills up, you either swap or you go out of memory. Disk, parent simply. Why are disks attached? There's no answer. I'm precisely only going to talk about network. So what you can probably get now is I'm narrowing down the scope to almost one single point of failure in the entire application architecture. I'm going to talk 45 minutes about this. So a lot can fail in just a network. There are usually fallacies of network that we talk about. You know, network is reliable. It's actually not. How many of you experience Wi-Fi not working yesterday? There is inter-VLAN, inter-LAN latency is almost zero. What that means is that we happily, as engineers, when we spin up VMs and instances on cloud, we assume that this isn't the same VPC. I mean, they're absolutely going to be zero latency. Network is homogeneous. The amount of testing that you may do on your local network, I mean, I can bet my house which I don't own on this, but network is not going to be homogeneous. The next time you deploy it on some other network is going to fail in manners that you have not seen. Simple stuff. You may test all, I mean, you can put in all the fancy words here, chaos, incident, et cetera, but it is going to behave in a manner which you've not seen. Network cost is zero. You know, when you go sign up for AWS, you get a bill, you're usually interested in how many CPUs cost this much, how many cores cost this much, how much RAM, days, et cetera. Nobody talks about network because cost we only see as a bill that we pay, not as an investment that we make, not as failure downtime that we incur. So there is a cost to network as well. Now when I look back at this photo, and I've given you more failure points to bunch on, now I want to go back to this and see what do you see as failing in this diagram? Simple flow. There's a request that comes in. There's a response that goes out. What goes behind making this something which is as simple as this? Now, if this was a college project of mine, I probably wouldn't care. But if I'm going to run my business on this, just these two lines can actually bring a massive amount of doubt and shame with it or probably loss of revenue. Now let's look back at it again. When we say something has to be reliable, what exactly do we mean by reliability? Reliability is a very broad dump. We've got to break it down to multiple factors here. First, do we say that at least one server is going to be online when my system goes bad at all points of time? Are all servers going to be below 100% consumption? Are all servers responding within x millisecond of latency? There's degradation factor to it as well. There's performance. Or do I mean all of the above? The fact that I'm able to split this down into four aspects, I want to spare a thought towards this that there is going to be effort going to be put towards each one of these points individually. Let's take the first failure case. Server is unavailable. What does it look like? Like this? There's nothing. Now, what we say is that, okay, nothing is really not a good state to be in. So, how do we make this reliable? The request that came in. The first arrow itself, I'm talking about something came in. Now, you had one instance. Now, you attach another instance. How many of you think that the answer should have been load balancing here and this is complex? Who would have just used a load balancer straight away and assuming this available in the cloud? All right, good. So, before we get to any concept of load balancing as well, I want to go down to the basics of it. How is a load balancer made as well? So, before that, there are going to be two nodes. There's a master, there's going to be a slave. Why? Because I'm trying to solve the problem of unavailability where one service, one server went down. My business goes down with it. To solve it, I need another server, but there is going to be a change in the IP address. How do we solve that? So, this introduces a very small concept called VIP, which you can pretty much, there's a project called Corosync, which was very popular. I forgot the new one, Heartbeat. And you can attach a virtual IP. Who understands VIP? Want me to explain it further? Anybody raise your hand? I'll explain it further. As you may have heard. So, you attach a virtual IP address to either one of them. So, we have solved the problem of at least one server should be on. One goes down, the other one is available. Let's look at the next problem. Now, a simple flow would look like this again. Request comes in. There's a VIP that it hits. It's going to pick one server of the two, whichever it is attached to. Request goes there, sends a response. Happy. Now, let's say the load was to increase now. This is not the answer to the load problem. The load problem requires load balancing is not availability. You can have a spare wheel, a steppe wheel. It does not go to balance the load for you. Load balancing is a different piece in all together. When we say load balancing, what we are saying is, there's a balancer in there, which may have a different concept. I'll go down to the constructs of it. It's going to divide the load among two servers. What's interesting in these two words here is, load and balance. To measure load, there has to be a concept of measurability. How do you define load? There has to be a component in your system, which is constantly measuring what is the stress on a given system. Then there's a word called balancer next to it. Now, ideally, we would expect everything to be half and half, 50% because we all believe in fairness, but we all know world is not fair and just, so we probably cannot guarantee that as well. So balancer now suddenly looks as complicated as this, where what we are saying is, once we are reaching something, so there's a request comes in from Twilio. You have a balancer because a balancer can go down as well now. Think about it, right? Now, you need two balancers. Now, this is cascading. You realize that simple two requests are becoming complex, but this is the need because I'm making a service more and more reliable and I have to address everything and this goes on and on and on. Now, there's a balancer. There are two balancers now. How do you balance between them? One way to do that is, again, we introduce the VIP because there's balancer A, there's balancer B, you attach a VIP between it and it balances. Another way could do this would be you have two balancers with two different IP addresses and the resolution is happening via DNS. That could be an answer as well. Now, next time when you look at a load balancer, I wanted to ask this question, what is the architecture underneath? And that will give you the answers. Example, if you go to GCP and you create a load balancer, it gives you one single IP address. If I go back, what architecture would it be? The one on the top, right? Now, when I look at AWS, it gives me multiple IP addresses. What architecture would it be? The one at the bottom, right? You go to a data center on-prem, your choice as your, I don't care. Now, so, I mean, if anybody cares, we can discuss that later. So, what it gives you is a single IP address, a dilemma, the standard thing, right? There's an element of doubt and confusion in everything. That's a pre-order. So, dilemma says, endurable, available, economical. In Hindi, as you call it, Sastrasumdhati kao, doh, lilo, teem ne milenge. Just pick any two of these three. What we are saying is, look at the cost of this. Because I made my service available and load balanced, I needed to have more infrastructure. What that means is it's definitely not going to be economical. I need to add more and more. Most of this is, is abstracted underneath the concepts of these cloud constructs that you create. Like, when you look at AWS and say, okay, I'll use a NLB, I'll use a ADAP, I'll use a LB, application load balancer, classic load balancer. We don't, all, we don't almost stop to think what must be going underneath it. You know, it's important that we understand what we are using to understand the failures behind it. So, coming back to it. At all points of time, you can probably only choose two. Now, what is an architecture that looks like available and load balance? Both. Now, here's an interesting concept. And this, this might look over complicated, but just follow me for a minute and you'll get it. Let's say you're, you have two application servers. You're both using load balancing and availability. One of them is getting 60% traffic. The other gets 60% traffic, right? If one of these goes down, traffic is automatically diverted to the other node. 60% plus 60% makes it 120%. Won't fit a single node. What that means is that at any point of time, you would require three nodes and not two to actually truly make it available and load balance. And at all points of time, you have to ensure that after failure, the load should not go beyond 50% on that node. Only then you are truly still available and replicated. Both. Everybody gets that, right? So a typical available and load balance will look like a combination of three servers, which is why we usually say, right? Have three. Where does that number three magically come from? You know, most of the times we say that, oh, there's a cluster. We've got to have three, but we don't care to stop to think why three. There's a reason why it is three. You can have five as well, seven, etc. But that's a different topic of cover later. Now, going back to load balancing, but this in itself is a pretty interesting concept. How do we balance the load? What is the definition of balancing? Ideally, we would say that if we had three instances, requests are going to go from one to three. We want it to be fair, but it's 33%. But how do we mathematically achieve that? How many of you know the Monty Hall problem? So this was basically a game show where what would happen is the three doors behind each of the doors. There could be a car. You have to guess where is the car. And behind other two doors, there's a sheep. It looks like a donkey. It's a sheep. So, so, and the first time that you guess, because the game host knows which door has what to throw you off, they would actually give you a hint. You want to go with this? And there was this mathematician, actually not a mathematician, marketing and economist, Marlin was seven. She came with an assumption that if you flip your option on this, when the host asks you to guess again, if you flip your chances of winning would be more. And for almost 20 years, nobody believed her. So typically what I'm going to say like is against common sense and most of you are not going to believe it. So it's a classic example of that. Now coming back to load balancing, what are the types of load balancing we must care to think? And this is where I start introducing tools. Because next time when you see that tool, I want you to understand what goes behind that tool and what problem is it solving? Right? Now, how many of you are aware of things called Fabio? Fabio is an example because I picked up a hipster tool that you could have used others as well. But this is the most one where you care about what is this? I don't even know. So you look at Fabio and what is doing is we're talking about server side load balancing. What we're saying is when a request comes in, it hits the server, which is our balancer, which in itself inside we talked about the two constructs that are available, one of those and is is there to distribute the load between the applications? Now that's a server side load balancing. The other kind of load balancing is look aside load balancing. So look aside load balancing what it does is typically what DNS is, right? You guys have must have used root 53 where you hit a request. There's a very pretty drop down which says edge weighted geographically distributed, etc, blah, blah, blah, many options, right? Now typically what you're doing is you are consulting another agent, which is responsible for giving you the definition of the load. So the root, the, the router of the DNS is going to return multiple addresses and your application is going to pick most of the times the first one out of it and console is another example of this that you would register health checks. It is getting the load. You consult console tells me what exactly is the IP address. It gives you the best fit node console does around Robin, uh, random actually and later you are able to pick one of those. Now over here, what's happening is if you look at the same architecture, what would be happening is that Twilio wants to call your service. So Twilio hits a DNS address behind which you have your servers available. It picks one of the IP addresses because it knows what the load is and it takes it. Twilio doesn't have to do anything. This is all happening under the hood. Now there's a third concept which is client side load balancing. Netflix was one of the pioneers of this approach. So what we're doing here is that is a client. What we are saying is take all these IP addresses off the server at any point of time and decide for yourself what algorithm do you want to use. But this is getting fairly popular these days, especially in, in internal application service mesh, et cetera. Our data as well later, the word service mesh, but this concept itself has become pretty popular because you don't want to add the burden of load balancing on the side of the server. So you tell the client that whenever you have decided who you want to hit, you'd make a choice and you hit the server because you're truly trying to randomize the selection because a typical random selection would probably be the fairest thing which doesn't add extra load on the server. Because if you go back, there are health checks with the the look aside load balancing as well, which is going to artificially amplify the load on your application. Propbox used to do this a long time back. What they used to do was in any given stable state of the system, they would artificially bombard the system with requests so that they are almost touching 80%. And when it actually crosses 90, they would start killing their artificial load. So they have typically what they've done is that they have given themselves a buffer to react. But health checks pretty much will do the same thing to you because the definition of health check is also pretty important here. I've not covered that in slides where we should talk about it. There are different kinds of health checks. One is which only checks on the surface is the IP reachable, but that doesn't tell whether the service is up or not. You need to make a deeper health check, which a request which will actually probably go make a dummy transaction in your database and come back and tell you that yeah, everything is working as it should because it's possible that the load balancer is working, but the server behind it is not. So you need a very deep, inclusive health check as well. And the more number of such health checks, the more is the load going to be on your server. Now, how does this look like? Because now the moment we say, oh, client side load balancing looks pretty cool, we can use it. Two things to consider here. One, in the example of Twilio, we probably would not be able to use it because I can't tell Twilio what code chains to make on their client side. So each one of these approaches have an implementation challenge as well. While you could have implemented server side load balancing or look a side load balancing because it's under the hood of the DNS, client side load balancing would probably not be an option which is available. But if you were to call someone else, now you know there's a third option to do it. And the third option would mean something like this. But there's a catch here. Because the number of IPs that are getting returned from a service, if it remains static forever, how do you refresh it? Because it's possible that once our died, there's a new IP address available, how do you choose between those? So typical flow of these are, so once you have access to my slides, you can actually click on these links that get up on it. So ribbon and curator are two libraries which are created by Netflix, which exactly do this. So what happens is, let's look at the timing diagram of this. The service starts up, it boots, it registers into a discovery service. You would have heard service discovery. This is how it is born, service discovery. So you register to a service discovery. Now the service discovery is responsible for pinging your service every now and then to check whether or not you're alive. And remember the definition of this ping, I don't mean the ICMP here. I just mean as a verb when you say ping and form to check whether the person is reaching something very similar. So the health check could be anything that you may use could be a deep health check could be just a TCP connect check depends on how much comfort and luxury do you have of making such calls. It returns an okay. Once it returns an okay, you add it and you start commissioning this as an IP address which is valid. And if it fails, you would probably pop that IP out. Now next time in a client asset, where is this service? This discovery agent in the middle is going to return a list of IP addresses. This leads to a typical confusion. How is this different from look aside load balancing? The difference is in look aside load balancing, you would get one IP address. The look aside guide is Diogol, whatever is is responsible for choosing the best server available for you. But in this case, you make that call. You get those IP addresses, you decide which one to hit and you make a call. So this leads to our question of what algorithm should you use? Should you use a round robin? Should you use a random selector? Should you use least frequently use most frequently use like this is a can of worms again after this. Each one of these algorithms are going to be challenging to implement by if you want to look aside need state, you must have a notion of what was called earlier. So there is to be going to be a global registry. Globals are bad. We all know it. We'll discuss that as well. But this starts introducing those challenges. So what we say is we'll pick up a random selector. Most of the times, you know, many times we said that, okay, pick round robin and random. Now, what is the problem with the round robin? How many of you follow cricket here? So when you toss a coin, it's never going to be homogenous, right? It's never that you always flip a coin. It's going to be heads and tails, even number of times. So this is a distribution of a coin toss. I didn't take it. I mean, not all coin causes would manifest into this graph, but it may. The fact is it's all probabilistic. So if you realize over here, an equal distribution of a head and tail only happens a few number of times. The total number remains the same over an entirety that yes, the probability of tails is 50% as head, but in a given window of time is never going to be 50%. It is possible that you have a series of heads and a series of tails. So any algorithm that you may choose as a random one is actually not truly random. And what it may result in is something like this where you started sending too many requests. Now imagine this case, right? Like if head was a decision that head is zero and tails is one, I go to server zero, server one, and I start sending my request. It's possible that most of the requests land on one server and not the other. This introduces another concept of load shedding in load balancers. That is, when you see a service degradation happening on one of the services, it sends a feedback back off something like that. A file beat does it by the way, if anybody's interested in reading that protocol. So file beat and logstash do this quite well. And it sends a back off. No more service requests are sent to it so that they are routed to another service. What you're typically doing is that a server that is under stress, you're not dispatching more requests. So going back to that same diagram of a Twilio comes in. It goes to a load balancer which hits your service. This is what is going under the hood and this is all that can fail, which we take for granted at times. This is a very good talk by the CT of Fastly who says that load balancing is almost impossible. Don't do it. And that would be my suggestion as well. Just take what is there, invest in good monitoring and let it be. Don't touch it. So when this starts failing, are we saying that we can never build a reliable system? Well, it leads to an alternate form of reliability choice. That is, it doesn't really have to be synchronous. We could go with, asynchronous. This is buggy, whether it makes two clicks. So asynchronous architectures, which is you can never tell when it's going to finish. It may finish in the background. So if I was to remodel the current architecture around the new concept of asynchronous, now this is where all the cues buffers, everything that you've read on internet is going to apply. You may have a balancer which sends to a queue which returns an acknowledgement at fine. I've got registered your request that I will remind you and I'll do it at my own pace. The interesting bit is, by the way, that the world actually is asynchronous in real world. While I'm talking to you, you're paying attention to your phone and you're sleeping and you're doing something else and X number of things. So real world is not request response. So if I now look at outbound part of my application, first was the inbound part. Those were the two arrows. I'm going to now focus on another set of two arrows. Your service calls to you that please call this guy. It's time. Now that you have learned how many times, how many ways the inbound flow could fail, we'll spend time on how many ways can this outbound flow. Now, these are the decisions where we start thinking of product. So does anybody want to guess how many ways can this fail? Three T shirt later. Okay, everybody is very close. Scope of failure. Let's say there was an outbound call to be made right now and you will be in a private IP address somewhere. There's a NAT going to be where that request is going to flow out. I mean, I'm assuming you're not using a public box for all of this and the NAT is not able to resolve the DNS and you're not able to call Twilio. It says IP address not reachable. The question is, what do we do next after this? Now these are the decisions that define the usability of a product. Should I not call Twilio now? Should I call Twilio again? How should I do this? In a stand, I mean, all of us have played games as kids, right? We always get this thing of failure and retry. This is where we introduce retries in our architecture. A retry, a standard retry, you know, transit failure. Every time you see this on your application, you keep that retry button because there was probably a small blip in the network at that point of time. It may not be a long time failure. It happens twice. You press retry twice. The third time it works, you say, okay, fine. I don't know why it worked, but it worked. Now what we're saying here is that there is an immediate attempt to retry. Retry once, retry twice, retry twice, which may be able to give you a guarantee across something which was intermittently failing. There may be failures which are not really going to resolve themselves in a few minutes. There may be a possibility that at the back end, the service that you're calling is going through another alternate server being brought up, which is possible, right? You know, if you retry in a span of a few seconds, five number of times, six number of times, you may not get a response. It's not guaranteed. This is where we introduce a concept of exponential data. So what we say is try once after one second, try another time after another one second. Third time you try after four seconds, then after eight seconds, then after 16 seconds, you may pick any algorithm that you may want. You are just trying to beat the odds. There is no mathematical formula of what is the best retrial mechanism. It depends on trial and error. It depends on your product. How much can you wait? Now, third is long-term transit failures. What we're saying here is there may be failures that are not going to resolve after x number of retries as well. Example, if I'm building an application and one of my payment gateways has gone down because US East is not reachable. Now, it doesn't matter how many times I'm going to retry. It's not going to solve anything for me. The best way to beat this is probably introduce this concept of circuit breaking. Now, what that circuit breaking means is, so all of this, you don't have to implement yourself. I'm not asking you to go and quote this because all of this is available as constructs out there in open source. I'm just here to help you understand which one to pick when. Isplix as a library. Now, there are Go variations as well. There are Python variations as well. Java variations as well by Netflix, which is available. You can use those in your code and you don't have to worry about it. So circuit breaking as a concept it does is when I know something is going to fail and I know that all subsequent routes are also going to fail. There is no point making that call for the subsequent calls that are going to come in for payment. It's best if you just don't make them. Example, if I know in my application, like let's say if I'm all over, right? I hope there's nobody from over here. But if I'm making an application and know that payment gateway is failing, just throw a static banner. Payment is failing. Move to an alternate flow. Move to a postpaid flow. All right. All orders will be Amazon does this, right? That if a payment fails, they switch you to all right fine, we'll accept the order and we'll take the payment later. This allows you to take those decisions into your product. Now, simple retries revisited, right? You have a service. We're just calling to Elio. If it didn't work once, this is not a programming language. It's just a pictorial thing. And you call that really again doesn't work once. It's all right. Call it again. What is the downside of doing this? The downside of doing this is what if I end up end up calling the customer multiple times, right? This is a constraint to follow as well. Imagine I signed up for this service and I say, Hey, I'll wake you up in another 10 minutes. And but 10 years later, I got a flurry of calls. Like I got 10 calls instead of one because I was busy retrying and I did not know that whether truly was able to call or not because I made multiple requests to it, right? Now, this is where which introduces us to dilemma. What delivery mechanism should we choose in a, in a, in a distributed system? Should we pick at most once delivery? Should we pick exactly once delivery or should we pick at least once delivery? How many of you use Kafka here? A lot of them. A few years back, I think a year and a half back, they came out with this, this concept that we have sold exactly once delivery. Are you guys aware of that? You guys use it happily. How many of you use exactly once constructs? Raise your hands, please. Okay, nobody's lying in this car. I get that. Why? What is at most once delivery? At most once delivery is fine. I made a request, nothing happened. You don't have to do any engineering for this. It fails. It fails, right? You don't have to write code to solve and at most once delivery. Everything by default is at most once. Now you're gonna try at least once. Same for loop that we tried last time round. This is at least once. I'm going to try once more, one more time, one more time, one more time till I'm satisfied that I have to give up all of it. What is exactly once delivery? Now, Mathias had a very good line here. Is there a researcher here? He said that there are two exactly hard problems. One is exactly once guaranteed order of message and exactly once delivery. Now this should give you a hint here. How do you decide something should only and only be done? Should I move away for the photo? So exactly once delivery is almost impossible. But there are ways to do this. When people like Kafka Confluent come to you and say that we have solved this, is there a Confluent guy? He's not lying when he says that he has actually solved exactly once delivery. How they did it is at least once delivery plus exactly once processing. I'm going to deliver a message 100 times. It's up to you to process it only once. I'm going to keep calling you. Do this, do this, do this. This is what managers do, right? Hey, is this done? Is this done? Is this done? But I only do it once. I don't do it that many more times. So the office of managers here. So what are the keys to exactly once delivery? The key is item potency. Everybody understand what item potency is? Okay, anybody raise your hand. If you don't know, I can explain it again. It's all right. It's good to learn it here. So do it once, do it n number of times should not matter at all. Atomicity. I don't mean a nuclear bomb here. All I'm saying is that the operation should be quantifiable as one single unit. But if I do this once, can you associate atomicity with any concept? Database transactions, right? They're one operation atomic operation. I may do 100 things, but they all happen as a single unit. Either it happens or it doesn't happen. There is no partial state in it. This is very important for exactly one single unit because if I was to call Twilio and Twilio was to guarantee that look, you can send me 100 messages, but I'm only going to solve it once. They must have a notion of getting a guarantee that whatever their distributed system or monolith or whatever the imperative code that they have, which has 100 instructions should only be executed once successfully or not. They should have an ability to say, yes, all of this happens together. Lastly, window. Window is very important here. What window means is for something to be called as do it once in what time frame. Example, remind me to buy milk. If I say that I retry a message to you and I say that, okay, I'm going to call you to remind you buy milk buy milk. And if I call you again tomorrow, is this a repeat request or is this a genuine new request? So it's a window to it. So all these operations have to be capped inside a window. And another challenge is out of order delivery because messages may arrive out of order. So if I look back at the same architecture now, here's a concept of request ID that is a very simple construct that we all use. What we're doing is on the service side, we're just going to send a small request ID and said, okay, is a request ID process it. And on Julia's side, I assume they must be doing a database themselves, which is atomic, which handles that have I processed this request ID earlier or not, all constructs of atomicity underneath SQL transaction or whatever they use. Now, key catch. Twilio is an external service. You can't guarantee whether they are going to use a simple construct like this or not. So what would you do? You would build an RID key on your side as well, that if you ever got an acknowledgement, don't process it. Now, why is a window important because if Twilio wants to maintain a request ID and keep an infinite sync of it forever and get a million requests a second, this database is going to blow up and every subsequent lookup is going to be degradedly slow. There is no database which gives you a consistent lookup irrespective of the cardinality. There are going to be degradations, right? So a simple concept of making an outbound call to Twilio will look as complicated as this. Many people would stand up and say, Hey, why do you need to make it so complicated? Because I'm trying to make it reliable. This is the cost of reliability. Now, we have handled failures on the inbound side. We have also handled failure on calling out the user. I'm going to talk about only a failure in that one small box inside. There are no network costs. Just that one box. How many ways that can that fail? Well, you get a service and you have a cron job to make. Sorry. So what's the problem of state maintenance? If I had a distributed system, I had a database and I said, why y equal to one sample value there and there are two processes which can both read from it and they can both one of them does a y equal to y plus one. The other does a y equal to one to two. What is the output state of this? You can't tell. There are four possible states. One, one does y. The other one does one plus one two and the next one does two into two, which is four or alternate route. One, one into two happens first, which is two and then you do a plus one, which is three, three and four are possible. Where did the one and two come from? One and two came from the fact that what if only one of these operations happened? The other never happened, right? You get distributed system. You can't tell if it is going to guarantee it will happen or not. What if both of them failed and nothing changed? One, you cannot predict the outcome in a distributed system. There's actually a piece of paper behind this which says that as the number of steps in a distributed system grow, the number of possible outputs are going to exponentially grow with it, number of states, right? So there is a process one, there's another process two, there's a small datum in there. What all can go wrong with just this simple arrow when they're trying to access? We need locking and serialization. There could be multiple requests which are coming in on the same particular entity. Now you've got to deal with all of them. What that means is there's process one, there's process two. One of them is going to get the lock. The other person will be stopped, they will wait. Once this guy releases the lock, the other person is going to get it. But it's not as simple as that. How long would you acquire the lock for? I took a lock for 10 days, 10 hours, 10 years. How do you define that? And how do you tell the other process that while I expired your lock, don't come back and do this again? Operations have to get on. So the next time that the other process wakes up somehow, decides to do it again, this leads to another simple construct. A lot of people ask me this question many at times. So we were having this discussion a while back and I said, I'm going to set up a disaster recovery site for you because you're only living in one availability zone. And well, fair and well, because one can go down. It was easy to understand. The follow-up questions actually baffled me. I was not empathetic back then, but I realized that the questions were valid. So the next question was, you wanted synchronously backed up or asynchronously backed up? This is a standard challenge with every database, right? When we set up two databases and two processes where we say that one is a backup of the another, what algorithm are we using underneath? Are we synchronously backing it up or asynchronously backing it up? The answer I got was I just want my data to be copied, not blocking the user transaction. But when the master fails, the slave should have it. This is impossible. Sounds very reasonable, but it's actually impossible. Why? Because there's a possibility that when I'm trying to sync and I'm not blocking the user flow, the other disaster recovery site is not available at all. So when this guy fails and this one comes back, there is that last transaction is not there. One may say, which is okay, I can live with that. But what if it was a million-dollar payment that a user made to you, which did not reflect on the other side? Now, when the slave comes back, you don't have that data. You have everything else. You just don't have those million dollars or a ledger of it. You may be okay with this design decision, but it's yours to make. It's a product decision to make. We can't make it in isolation. So what we do is there are ways to solve this is how many of you have heard Paxos and draft don't you don't raise your hand. You obviously work on it every day. So this is basically where the clustering algorithms come in the picture. What they do is this is where we start talking about quorum-based rights, you know, at least two of the three people would have gotten it. I would have gotten an acknowledgement. What that means is that even if one went down, the other person has that state. If a third process comes back as a replacement, it's going to auto sync. What you're coming back to is that both availability and reliability are going to require three, at least we're coming back to that same principle here, right? So this is what a clustering algorithm would look like and all of this would have to be done for your state maintenance. So coming back to scalability, what we're saying is it's comprised of logical and data decentralization because if one guy was using it, it can fail in multiple ways and then there's data replication as well and there's reduced communication. How does reduced communication happen? It happens if client could absorb all the risk, right? Example, if you have any form to be validated, don't send it every time to the server side, just do it on the client side. Just use a validator. Reduce communication. The less you communicate, the less chances of failure because network is unreliable. Now this leads to cap theorem. How many of you are aware of cap theorem? How many of you are not aware of cap theorem? Okay, cool. So there are a few people who are here. Okay, cool. So what this means is consistency availability and partition it talks about. Consistency means if data is available in one single place and all its other copies are going to be consistent with each other. There's never going to be a situation where one copy is behind the other copy. The situation that I just mentioned a while back, that is possible that your disaster recovery site hasn't received a few transactions that would not happen. Availability means no matter what may be the case when I hit this service, I always and always get a response. Then there's a partition as well. Partition is in case of a network partition, the system should hold itself. The last one is actually a misnomer. It should not even be there in this theorem. What that means is that the system cannot be not network partition. Tolerant. What it means is in case like if you take that same example, right? I don't know about the diagrams. I don't. But if you go back to the same diagram, there are two servers which are speaking to the same database. In case of a partition, one server cannot speak to the database. This is going to result in unavailability itself. So let's not talk about partition tolerance at all. It is a must. A system must always be tolerant to network partition. So what that means is between the other two, I can only pick one of consistent or available. Either I will have every copy which has the transactions which have been fed to whatever definition of consistency I may have. I may have a quorum consistency, which means two out of three majority of the node should have the data or I pick availability. An extension to this thought is TSE ELC theorem. What that says is now this one's more understandable latency and consistency. What it says is if you want consistency, it will take a while for data to circulate around. And if you are okay with a very low latent five minutes, so if I have I want a very low latent system, then I can't expect consistency. It's possible that some of my data may be delayed to arrive because I want a fast prime. I wrote I want it back, which means very low latency. The system can be eventually consistent. It may sink back later. Now there is no answer here that which one should you choose? It depends on your application. If you're making a banking transaction, if you want it run, you want it consistent. If you're making a Facebook comment, nobody cares if nobody else saw it for another five minutes, right? Maybe it's a product decision that you've made. These decisions cannot be made on engineering alone. Device flow, whatever we spoke about looks like this now. There is Twilio calling a balancer multiple copies of it. There's a service, there's a queue, there's a data store which is using locking consistency algorithms itself and there's item potency in build. Very simple application has become a beast because we just talked about reliability. Now a lot of people come and tell me what about spanner? This is available and consistent both. Well, technically they are and they are not lying about it. Now how can a system be both consistent and available if you can only pick one? Now the answer to this is on Spanner's page itself. If you go to the FAQ, Google Spanner is a database which is globally available and at the same time it ensures that the data is consistent everywhere as well. Actually the answer is they are only a consistent database. Just that they call themselves available for two reasons. It's a proprietary network. Google has its own network behind it. So they can guarantee you the network. They are not running on any other public network. The ISPs are owned by them. They have projects where they run deep submarine cables. Lastly, they guarantee you an availability of five nines. Your system to catch that downtime should be more available than that and you don't have it because their system is proprietary and they live more in. So the answer is there are only a consistent system, just a very highly available one, not 100%. You can read more on the page on this. They have beautifully explained this and they also use geo satellite sync of time for this beautiful work. So reliable system has to be transparent, scalable and correct. Transparency is of access transparency, location transparency, concurrency and failure. Access transparency would mean that a system which is distributed underneath, it doesn't matter if I call Google.com. It may get me the data from anywhere because I don't know how it is accessing where it is accessing location is the URL would always remain the same. Concurrency means if four people are accessing it at the same time, it should not bother any of the users that were accessing it concurrently. Failure is if it fails underneath, let it fail. It should not bother the user. That's what a definition of a reliable system would be. It should be scalable, size scalability, geographical scalability and summary being the next time when you look at any of these tools that you have out there. Just think of these as four quadrants. Consistent, economical, available, related. You can only have an intersection between two of them. There is no way you can have three of them. Make your tradeoffs, make your choices, based on what your product and your business allows. All put together. Look at this diagram again. Take home would be and raise your bugs. The key is not to fail. He's not to fail twice about the same thing. There is no silver bullet. I cannot give you an answer of whether when it is is better or is better or back source is better or raft is better or AWS is better. GCP is better. It depends on what tradeoffs you're going to make. There's a cost to everything. And lastly, just think of product first and business first before you make an engineering investment. Thank you. I'm Piyush Verma. I had site reliability engineering at Trusting Social Network. We have five seconds for questions. I have a great talk. You said you talk a little bit about service meshes, but I don't I don't recall if you actually mentioned it. Can you briefly just go over? I think it's there. Or may I must have missed it in the narration. I'll speak to the outside. Okay. Any other questions that we may have other than where is the lunch? Where is the lunch? Thanks. This is an excellent talk.