 So, first question first, how is reliability different from DevOps, where one is an ability of a system, the other is an action. My ability to eat doesn't mean I am going to eat, these are two different attributes, how does the system become reliable is what I am just going to talk about. Few questions, who is more reliable in all of these four, Fabio, Traffy, Kong, T-shirt dunga, hatra daro, none, that's correct, okay, which one is more reliable of the four databases. I mean there are four things, right, like so say something has to be better than the other one, who is more reliable zookeeper, HCD, console, Belandam Khan, anything, no, last one, anyway, so how do you define reliability actually, like does it mean that at some point you're so one server is online, does it mean that all servers are below 100%, does it mean all servers are responding within x milliseconds, or does it mean out of the above, so these are important questions that we have to ask, you know when we say that something is reliable or not, a lot of people get confused, like you know DevOps and SRE, what is the difference, when SRE job is to make sure that the system is reliable, it doesn't mean that it is reliable, as in operations is something that you do on it, but the ability of a software in itself to be reliable is the question, so if I was to say an ideal software, right, is the one which does not require any human operations, you know it keeps on running by itself, you know when we say that just works, imagine when you say the laptop is reliable, does it mean that it keeps working or does it mean that you're doing something with it, right, I mean that's what operations is, so if I have to look at a curve, the cost of the amount of operations that you do with a software has to eventually drop down, so it starts working and that's where you would call it, yeah this is a reliable system, you know I don't have to spend time on this, it keeps working by itself, so to get there let's just take a very sample application, a lot of people might have actually seen a talk of mine, but I made some changes here, so if you've seen this on YouTube already this slightly different, so we're going to just take a very simple application, remember the milk, you are going to send an SMS to a service, the service registers, this is the time at which I'm going to send back an SMS or call back and we'll just remind you, like hey you asked me to remind you at Tuesday 9 p.m. by the milk, so by the milk, fair enough, right, so I send an SMS which has some text on x date, remind me to do this, when the date and the time is right, the system calls you back on your phone, very simple thing. Now, there are two aspects of this, one is the part where user sends an SMS and the service receives it, the other part is where we call the user back, right, we're going to make it further simple, what we say is let's say we're going to use Twilio for this service, now while using Twilio you receive an incoming webhook, the webhook records that this is where I have to call back the service and it sends a response to 100, a very simple arrow that goes in goes out, we're going to focus so many ways by which these two arrows can now fail, right, now these are two systems, Twilio is another service, my service is another system, these are two distinct systems here, a very quick quote from Leslie Lampott which says distributed system is the one in which failure is a computer that you didn't even know existed is going to bring your entire system unusable, where is that system in here, the buoy diagram, now let's try to find out where that system is, because this guy can't be wrong, he's written so many research papers, so there is a system in the middle of these two flows, which you have to now find, now failures what are failures, there's a CPU failure, there's a memory failure, there's a disk failure, there's a network failure, a CPU failure, well it can't lie within these two systems, right, a memory failure is not within these two systems, a disk failure is not within these two systems, the only thing that can fail between these two systems is a network, fair enough, now let's look back at this picture, so there is something, there's some computer which is going to fail, which is going to bring this entire system down just to prove Leslie Lampott right, now who's more reliable, AWS, GCP or Azure, the pick one, let's start with differential analysis, Azure has to be some other tool, right, so but you can't decide, see I'm asking you these questions for a reason, because a lot of people ask me, you know, tell me this is better or that is better, like which is better, Honda's car or Maruti's car, you will not tell, right, the dimensions to every problem, the goal is by end of this talk I want you to just understand what are the dimensions of a distributed system, you know next time when you hear a fancy tool, well Rockstar DB, well Superstar DB, Popstar DB, you know, Cockroach DB, Insect DB, some DB, Villain DB, I want you to ask those questions, these are the dimensions on which I'm going to evaluate these and if somebody is telling you all of these are better, he's lying, trust me, right, they have to be trade-offs, computer science is all about trade-offs, we're just going to learn those, let's say first system, well failure point, server is unavailable, what does an architecture diagram look like, like this, there's nothing, now when there's a failover, right, I'm going to introduce one single concept at a time, there's a thing called VIP, a virtual IP, a virtual IP is basically where there are two systems who are trying to compete for an IP address, because every service that you call, this is about networks, right, you can make a call, a socket, a socket talks to an IP address, the IP address, if one server fails has to switch over to the other server, now these two systems are going to compete with each other and try to acquire an IP address, but only one would get it at a time and this is a concept of a VIP, age-old concept, 1980s, 1970s, when the movie Terminator was made, the first one, core, I mean core was saying keep alive the heart beat, you can use anything, but it's important that we remember this, you know, we learn this, because a lot of people will tell me, you know, like if you have to build a failover system, why don't you use a load balancer, I'll come to a load balancer as well, so in case there is a server A and there is a server B, one goes down, a VIP switches over to the other one and that takes over the request, right, so we've got a system which is actually working on a failover, but a failover is not a load balancer system, it's just a spare step in the, right, you know, when the wheel goes down, you'll pull out another one, what exactly does a load balancer system look like? A load balancer system has a server A, there is a server B, requests are going to be divided, how? Equally, unfairly, fairly, we don't know, right, now this is where we have to think about different schemes, these two words are very important, load and balance, how do you define load? Am I saying number of requests? Am I saying CPU usage? Am I saying latency? There has to be a definition, what's the best metric of load? Number of connections, CPU, memory, what? Or latency, we can't tell, right, that's the beauty of it, right, because it depends on you, if somebody tells you this is the best load metric, no, line, again, right, you have to choose by yourself, because each one has a different trade off, I may go by number of connections, but it doesn't mean that the number of connections are being served at the same time, buys in time failures is a word that we come up with, right, one server is running slower than the other server because there is degraded disk, number of connections can be the answer, is it CPU load? Well, I just said disk is failing, you know, so CPU load cannot be the answer as well, this is something that you have to figure out by yourself, by observing systems, by looking at them, you know, this is behaving this way, this is behaving that way, now the question of balance, now you have to choose to send a request to one system, and to other system or or to other system, how do you decide, should I send 2, 1, 2, 2, what's the definition of 2, is it still going on or is it about the number of requests that have been served, right, now, it's a pretty complicated problem, now before I get to those, architecture of a load balance, you know, next time when you use a load balancer, you have to think that a balancer itself can go down, right, when they give you an IP address, you will wonder that why does my server go down and not that load balancer, you know, like how does that work? It does, it's just that the complexity is now encapsulated into that concept of a balancer, now that balancer could be working in either of the two modes, it itself could be working in a distributed manner where there are two of them, or it could be working on a VIP base principle, right, now when you look at the next time, when you look at a load balancer and when you say that GCP only gives me one single IP address, do you want to take a guess which one is it using? The first one, right, because they are using that topology, but when you look at AWS, when it gives you multiple IP addresses, it says the lower download, right, but the IP address is for you to balance, the standard way to do that is by using a DNS query, right, when you attach a DNS name, you attach multiple IPs to it, it has offloaded the balancing part over to the user, over to you, and you decide am I going to use AWS rooted weight policy, am I going to use number of connections, whatever, all those things that we talked about. Now who uses what? GCP, fair to guess, right, it's the first one because they give you one single IP address, so there must be a VIP somewhere, AWS gives you multiple IP addresses, the one which is down below. Now when they give you multiple IP address, each IP address internally has, is a VIP, because that can go down as well, right, it's layers of complexity, same constructs which are being used over and over again to make a cathedral or a bathroom. Now what's a failover plus a load balance, right, now what's an ideal load on a distributed system? I should keep X percentage free, any guess, like 50% should be free? Well, this one is like Mathematica, so we can take a stab at that. So what's the problem with the one which is down there, you know like, let's say there are three servers, and we are going to take 50% consumption, right, what's the problem, if every server is 50%, what is going to go wrong? Exactly, one zone goes die, dies, all the traffic is going to move to either of the two, and they're going to touch 100%, the moment they touch 100% game over, right, because there's, I mean you don't know how it's going to behave, is it going to be OOM, does it mean that memory starts swapping, or you start getting degradations or CPU gets 100%, you don't know, every of those failures is going to behave in a different way. So in a typical system the threshold that you would accept is, well somewhere around 40%, because if one goes down, 40% reaches the other two nodes, and it's in 18, you know you can compute these numbers mathematically, now when you've got a four node system, what you would say is if one goes down, now this traffic should be distributed between the other two, now this number will look somewhere around like 47%, again, now you would say, why is only one going to go down, what if two goes down, right, very fair question, two would go down, they're sleeping in the night, two zones go down, well rarely went, but it depends on the complexity of your uptime, if you want a system which is like really, really uptime and you don't have people, well either you pay people or you pay a server, right, so you would actually make it four zones, now that can work as well but you have to shell more money towards it, it depends on you, now if I was to ask you, which is more reliable, three rack system, four rack system, five rack system, five is the most reliable one, but it's not reliable on your pocket, right, so I mean, so you've got to make a guess there, like you know how much failure can you adapt to, so next time when you say that hey US East has five zones, but Mumbai only has three zones, US East must be more stable, well I would say the other way around, why did they need five, right, maybe they'd go down more often, you know, but the answer is up to you, how much failover do you require, you know, most of these companies are based out of US East, so there's a stat here, funny fact, which is AppDynamics came up with a report and said that every time there is a zone that goes down for five minutes, in US each company loses $480,000, and you know, so it's up to you, like how much your failure can a business sustain, now, reach is a dilemma, which one should I choose, should I make my system more endurable, more available or more economical, can I say that I want the best resilient system, but I don't want to pay money, it doesn't work that way, so you've got to pick either two of these three, right, the standard thing, now coming back to the same concept, load balancing, and what exactly is load balancing, we discussed the challenges to it, right, now if I was to pick up a most reliable algorithm here, which one is that, is random better, is round robin better, is least frequently used better, or is least connections better, yeah, depends on, you know, that's a funny part, yes, the answer depends was working on everybody or every other problem statement, but this one is not going to depend, you know, let's take a sample from, right, let's say we flip a coin, head and tails, right, our goal is to be fair, right, on a large proportion number, head and tails are going to be equal on a coin, right, like they show up almost the same amount of times, but if you take a window, and if this is going to be applied to operations that are happening in your window, heads and tails are going to be uneven when you look at it, right, because at some like, if I take a small window, right, a head and even, head tail is even, right, for that period you said that I'm going to take an algorithm which is random, is going to work, now look at another chain, you know, what if you get a sequence of heads, and those are the number of requests, now suddenly you realize that four requests go to one server, one request goes to the other server, right, so the actual answer to this is, if you pick up the random one, is the, your best probability that on a large subset of system, your requests are going to be equal, if you're trying to maximize or optimize for a smaller window of operations, you just can do that. Now there's actually a very good talk by, I should have actually brought that earlier, by Tyler McKenna, and it says that fastly CTO, and they said that load balancing is almost impossible, because no matter what you do, any algorithm that you come up with, you just cannot be fair to a system, because you're going to run into challenges, now you decided to send 3-1 at one place, create another place, but those three were the really bad requests, they were video encoding requests, right, or you say that, oh it's a video encoding requests, they're going to be equal, well one video is 100 MPs, the other was 10 MPs, right, there are so many attributes that you can decide on, so your best bet is just make it random, coming back now when we go over to load is, we know how load balancer works, which of these techniques is actually the best technique, should I use a client side load balancer, a server side load balancer or a looker side load balancer, now before I get there, I think it's important that we understand what these load balancing techniques are, server side load balancing is dead simple, right, like everybody has been using a Nginx, Apache or HAProxy at some point of time or the other, right, or the fancy one, Kong, traffic, and blah blah blah, right, a request comes in somebody decides for you that hey, this is the best server that you should go to, please go there job done, now looker side load balancing, this is what a DNS does, right, when you hit a server, when you go to google.com, what IP do you get, well the random one, right, so some a DNS server decided for you, not google.com servers, right, but a DNS server decided for you that hey, this is the best IP that you should hit what I think, now this is looker side load balancing you're making a query between 50 IPs that it knows of Google, it will give you one IP and say that go hit this one, now there are certain cases, there's a new one, which is client side load balancing, instead of somebody else making a decision for you, it gives you a series of IP addresses, you know, like go make your choice, whatever algorithm you want to use, least frequently use, most frequently use up to you, you know, do it the way you have to do it, now how does that work, client side load balancing, now this is where service discovery comes into the picture, you know, like most of the times when we say that oh, we should use console, you know, best service mesh principle, while you're using let's use this, etc, let's take a step back, how does it actually work and why is it that way, you know, all these patterns, sidecar pattern like, whatever, so this service, there's an agent client, which wants to call the service, now there's a discovery layer in the middle, when a service comes up, it registers itself on the discovery machine and also gives a small health endpoint, you know, like call me on this and check whether I'm alive or not so it adds itself, the discovery machine keeps pinging the service to see whether I'm alive, whether I'm not, now next time when a client wants to call that service, sorry, the first thing that it does is, it queries the DNS, the discovery service and says give me all the IP addresses that are available, now it will only return the IP addresses that are healthy, because it's constantly doing the check for itself, so its only role is, now this might sound very similar to look aside load balancing, but the difference here is, it's going to only do health checks for you, or may or may not, you know, you may decide I don't need health checks, it's a perennial service, it doesn't go down ever, you have that confidence, so it gives you the series of IP addresses then it's up to you, when I say you, I mean the client code then it can choose what am I going to use, so each client may decide to actually use the next one, now why is this important right, let's say you're building a mobile application, now while building this mobile application you have to call a payment gateway, now if payment gateway gave you three IP addresses, and in your code you are actually checking if then else, what it allows you to do is, when you distribute this code out there, right, you're sitting in Bangalore, you're sitting in Delhi, you're sitting in Pune, now three of you are trying to make a payment if there was a static algorithm for this, all of you are going to try to hit the same server or somebody else is going to do it, what if the edge network goes down, what if your ISP goes down, you know, or what if that one IP is not reachable by you but reachable by you but the third one is not reachable by you, the fact that the decision is being done on the client side, they can actually adapt to it and hit the right server pretty advanced, you said that why do I need to add this complexity well it depends on the uptime of your application, if it's a very sensitive application you would want to do this, now introduces another concept of load shedding, you know this is one of the things that we don't realize what load balancers are doing for you is if one of the systems is under too much stress, hey wake up so if one of the system are under too much stress what they do is they will cut off the load to that server and they will try to send the traffic to another server while it is waiting for those requests to complete, you know exactly what used to happen in our 90s every time you are watching a cricket match or a movie there will be load shedding, electricity gone, same concept exactly the same one, now while it is doing all of this for you, you come back to the same conclusion load balancing is almost impossible so because it is so impossible you can't choose what load balancer is better than the other one, it's only a situation which may look like that this one is better than the other or your familiarity with it, now the first part has failed enough, let's look at the second part, I am going to call a user, how many ways can this fail, now let's look at the outbound call, now Twilio was calling me and I am going to call Twilio, simple just two arrows, somewhere in my system I would be going via a NAT address which is the outbound does everybody understand NAT, does anybody want me to explain NAT, please raise your hand it's alright, everybody gets it, now NAT is the root outbound root which is taken from your instance because it is not a public instance, so you still need to use something else for translation to send the traffic out, now NAT cannot make a DNS discovery of Twilio, what should I do this is a very important question, should my code just give up or what if you fail in your exam, do you give up or do you retry retry, but how many times retry, all of us have played Mario in our childhood we don't get it once every night, I mean it doesn't matter how many times you get there, it's a lie anyway has anybody completed Mario ever, oh cool now retry, now when a user sees a movie payment button and they say it's failing or play store app cannot download, they say retry, now this is once, they retry once or you may retry multiple number of times but my question would be how do you know that if it hasn't worked once in one second it will magically work in the next five seconds, will you keep retrying forever, so what they say is may take off an exponential back off first retry after one second, second retry after two seconds, then three then six, then eight, then fifteen, maybe it's one day it will work, right but you can keep going on, because what if it was a request, which was actually waiting it's not an asynchronous architecture, right, so what do you do, so here's a term very important one called circuit breaking, you know what happens is, so most of has anybody heard his tricks his tricks, H-Y-S, okay cool, good, who hasn't heard his tricks okay, cool, I think there was a guy who didn't raise a hand, so it doesn't match up the total here, so anyway, so okay, maybe they know his tricks, it's okay, so what happens is that when you're about to make a connection, right let's say you're a service, you're I should not use hotstar because the people from hotstar here, let me use netflix, they're nobody from netflix, yeah, so yeah, so you're watching netflix, right, and they're trying to load content for some reason akama is down, right, now hosaga is know this, so the problem, what happens is if one guy is down and you realize that akama is not working another failure happens, three failures happen, four failures happen is not meant to automatically work, right, you know it happens, you know earlier when it used to happen that, you know, on Slack, no we should do this, right, like a lot of time when I was building these web services I would realize that if a service resulted in an error, the user 500, it would result in a flurry of 500, my usual answer will stop trying already, so I'm going to work automatically, I'll have to fix this, right so it's not going to work by itself akama might as well the next user who comes in and watches the video on netflix it gives you a chance to say that our services are facing degradation rather than letting it for the user to go through that painful service of buffering, loading, and then something else has failed where everybody is going to see this, don't do that, right, break off the circuit, akama is down, I know this let's try back after five minutes, now this is a pattern called circuit breaking, very important because it defines the usability of your application it's living on a fact that if it didn't work once, it didn't work after a certain number of back off, give it some time, it's not going to happen by itself so start using a different control mechanism, maybe through an error page or a good looking error page, or alternatively watch this video stream instead, so who is more reliable retry once, keep retrying or circuit breaking where you don't know which was more reliable because well keep retrying, what if you deduct a balance multiple times how do you know that, now let's take as a user example here, right we're calling Twilio to call a user, by milk but Twilio is failing to revert with the acknowledgement to me and you say keep retrying he's going to get a call five times, by the milk, by the milk, by the milk, like fine, delete the app right, you get my point right, it defines user experience here you can't say what is more important, you would say if there was a magical way that we're going to only try it once it will work now this is where we fall into a trap so again I'm going to ask who is more reliable, at most once I'm going to make it slightly mathematical here, at most once delivery exactly once delivery or at least once delivery at least once, cool, let's look at at least once, tried, didn't work, failed you didn't get a call, so it can't be that so because there's a, see if you only try something once testing that board exam right, you were in tenth, I failed it so if I would have not retried it, only tried it once I would have never got my graduation degree, so it doesn't work that way you say at least once delivery, but does that mean that you keep trying forever and ever because you don't know what the outcome of that would be does it result in 100 calls being made to the user, let's take a simple bug in an application, Twilio is failing to return with an acknowledgement to mean that I have called the user, what should I do pretty tough, it's a problem here, there's no simple answer for this, so what do we say, here's a thing, we go to our problems, exactly once delivery, guaranteed order of messages and exactly once delivery, cool, let's solve this one so you know the best is, we say exactly once delivery if there was a magical way that we say that we're going to do this thing only once it's fine, you know like deduct the balance, should I keep retrying or should I only deduct it once, once right, because all systems work on that now it can only be done if there was at least once delivery and exactly once processing what we mean by that is, we got to try again and again, but the successful operation should only be done once, now it matters how we define success here, now this is another question what is successful, so I'm going to try 100 times to explain this to you, but it should be explained only once successfully, now key is, success is defined by item even if I run this multiple times over, it should not matter the output should remain exactly the same, atomicity right, now atomicity, everybody understand what atomicity is it's like all of it either happens or it does not happen, you know it's like a database transaction this has been age old concept again, but just applied to distribute systems I should either do it completely or just roll it back completely, now how do we implement this up to us and a window, window is how long now let's take an example, if Twilio you would let's take a payment gateway right, here is a data, make a payment this is a request payload, which has a request id in it as well this is one of the ways of achieving item potency, you send a request id, the other service receives this request id, it sees oh I have actually sent a payment once, I'm not going to proceed further now this is okay, this is good because I would say I'll keep it in somewhere right, but in a payment system where billions of payments are happening, this hash is going to keep growing right, like if I have million requests an hour for payment, I will have million request IDs, multiplied by 24 for a day, multiplied by 30 for a month multiplied by 12 for a year right, we're looking at almost petabyte of record, what that means is every subsequent item potent operation is going to get more and more expensive right, so system is naturally going to a degradation state, so it can be that bad a design right, hence a window, maintain item potency only in a day or an hour up to you you know, if I see this request id for the next five minutes I'm not going to process it, but if you send it again I'm going to process it again, it depends on that hence these are the three key attributes of exactly one's delivery because you have to design these constructs that way now we don't care to pay attention to these things you know when we design systems, now I'm going to ask you a question here, if I was to revisit this would look like right, I have a request id, I send it to Twilio it processes it once and then it calls a user hoping that Twilio has some sort of request id notion in it if it doesn't then well, good luck keep trying which one's more reliable now, Kafka, Celery, Rabbit, MQ, Sidekick with all these questions you know, like does anybody believe here that if I just install one of these my problem will be solved it's okay, you can believe that right now no, okay good because these are just the tools that you need to get there, the problem behind it will exactly remain the same Kafka is not going to automatically make a forever resilient system for you while it has certain constructs you should know how to use those the window still has to be configured by you, it's a property of that system so the next time when you run into the system which is a queue which is going to retry I want you to ask this question, does it have a window operation where I can set a window of retry does it have this notion of atomicity does it have this notion of well item potency again, those are your consumers not the queue processor example, if you are taking a message of Kafka and you are doing a bank transaction after that but if you don't maintain item potency on that end and somebody goes and rewinds that topic ooh, game over, right all the balance is lifted 100 times over right, so these are just the ways to get there but you still have to understand the impact of those, you know most of these things we realize after we have failed, good would be for this you know, understand these dimensions up front so takes me to another part, guaranteed delivery in a multi-party system is almost impossible because there's so many people involved who have to care for item potency now, I'm going to only focus on the middle there which is the storage, which is data is stored, it has to be activated again, no network, somebody said something yeah, now, yeah, I'll get there there are two processes, var y equal to 1 one process does y equal to y plus 1 and another does y equal to y into 2 what are the possible outcomes of this should it be 1, 2 or 3 or 4, please raise your hand, we don't have too much time every time audience don't answer, you gotta stop that timer the answer could be anything, y equal to y plus 1, the other one fails so you got 2, y equal to 1, y into 2, the first one fails, the answer is 2, both of them fail, y remains 1 both of them happen, first one happens, multiply by 2 happens so the answer is 4, over 3, 1 into 2 happens first and then you add a 1 later, you know all of these are possible states in a distributed system and there's nothing wrong with it, they're all right states and how you serialize them, if there is a parallel operation happening between two states, they will compete, they will do by themselves how do you control that, you say that I need a log, when you say a log, one thing is going to happen after the other, it's no longer parallel, but how does a log still work, you know one guy takes a log, read log, write log, semaphores, mutexes, we have studied all in college who have gone to college, that's good, so I'm going to skip this part now, master master's replication, one would say that, now the point that I'm making here is, you try to save one thing into the database and there's another operation trying to do something about it, now how do you control these two operations, is what the challenge is, do we say that one operation, now let's take a sample example, Kron is coming in and saying how many of these jobs are there which need to be called, now there'll be obviously let's assume a best case, we have become a very big service, so many people are trying to depend on us, now I can't call those users one by one, so obviously I will have to do it, try to do it in parallel, when I do it in parallel there will be multiple workers which are trying to feed off the same queue, how do I tell that queue, take one and the other queue I tell, don't take this one, because this one is already being processed by someone, so the same state is being referred to by multiple people here, multiple processes here, I try to give a human touch to processes, it gives them a live nature, yeah, now one would say let's use a database, you know which has master master's application, or the one says you know let's take a master slave application, everybody understands these or should I explain them, how is master master and master slave different, all good, now one would say well these are, you know they have limitation, let's use something like where fancy word out there, paxos and draft, let's use that, now I want to come back to the same question, who's more reliable, master master, master slave clustering or eventually consistent, we run into these terminologies every single day and I try to make a gun and say, you know I met so many people who say hey why don't you use Cassandra for this, it will solve your problem, how does Cassandra know what I want, I mean I would love to if they would actually solve that problem for me, the challenges here are a master master means if one operation fails on one master node it will not go to the other one as well, so the entire operation is going to fail, master slave means you are writing to one master, if a slave goes down it won't get the record and by the time it wakes up and the master goes down those transactions are gone, you have to design an application where you are aware which one is a better choice for you, eventually consistent, something is reading from a slave, a master is eventually going to write there, it's possible that guy doesn't get it, it's okay right, so again these are challenges that you have to figure out in your system clustering, well just, I just threw in a word in there because people just bring that word to me a lot, now raises me to cap theorem so talk me over here, now question here is we have to keep an eye on this one right, what we are saying is we are going to throw three words here, consistency, availability and partitioning partitioning means if there is a network partition where two of your master master cluster can speak with each other, how the system behave, it should keep running right, is that what our assumption is or should it go down, well on the server side two racks can speak with each other, should it bring the entire service down or what should happen, we expect that it should continue to work right, so let's keep this aside partitioning can never be an option, it has to be always partition resilient now the only two parts left are consistency and availability, does everybody understand cap theorem it does, I'll skip through does everybody understand this one, okay cool, yeah this is just an extension of cap theorem right, so we say that when we understand cap it works this way now in case of a partition we are going to choose between availability and consistency but when there is no partition, it says else choose between latency and consistency, what that means is either you have to understand this right, now data is either going to replicate right away right, no it doesn't work that way there's network involved, we just discussed that, so data is eventually going to get to the other place or it won't get to the other place right, so pick between the two, either say that I want my data guaranteed to be replicated on both sides and if you do that you can't have a fast operation, because it's taking time and the network may be slow, now or you say I am okay with data getting acknowledged that I have written it, but it may not have reached the other place, because the other operation that is going to read from the other service may not get it by time right, so we are going to pick between these two now if I look at a revised flow of our application now this is what it's going to look like, because we are trying to make it more reliable, one sense it, I got multiple copies of balancers, multiple copies of balancers have multiple service instances behind it, it does some magic to figure out those magics that we discussed, which obviously now we realize that oh service cannot just, there has to be a queue as well or some sort of data storage there, because a cron is going to read from it, now when the cron reads from it there are going to be some locks somewhere there is some read is as well, there is some item potency notion, all of this is how you design your application eventually, now you may omit this part where you say oh this is too complicated, but at the same time you have to understand you are going to compromise that aspect of your reliability of your application, now I will just take a minute more, now a lot of people say you know your line, I have seen Google Cloud Spanner Cloud Spanner does this all for me, now I have seen people say that too you know like Cloud Spanner is both available and is consistent, no it is more available than your application, no that is true right, when you say that my application had 99% uptime, no no it had 99% of the uptime every time your monitoring node was trying to reach to it, it could have been sleeping in the middle right, now when I was in my class room I was the most of a student, only every time when my teacher looked around right, so no this is important to understand because the underlying hardware that Google is going to use is controlled by them, and your application runs on top of it their hardware is controlled by them, so it is just more available than your, do not try to read it, it is there on the blog actually, just write cap, spanner and you will get it, and the same goes for that other thing as well, Calvin right is exact same reconstruct there, just more available than your application, and what they realize is that your application fails far more times than their network fails so they just call them more available, but eventually it is actually consistent, what they try to do is they try to replicate, and they also have the magic of the global timestamp, which we do not, because one of the other nodes is going to always fall behind the timestamp, so Google what it does is it actually syncs it with geo-satellites so their timestamps are correct to the second, every single node because they are connected to satellites, so answer being absolute availability is almost impossible, you know someone says that I am 100% available, they are lying, they are just more available than their monitoring application, now a reliable system how do you build that, the key aspects are it has to be transparent, transparency means that no matter what happens, the user is always oblivious to it, you know like a DNS goes down, like your request goes routed to some other server, like google.com still works, now all of these are transparencies of properties, scalable, horizontally scalable or vertically scalable, your application, I mean depends on your workload, how you are going about it, it's possible that you need more data storage now, so goes vertical, you got to expand horizontal because now you are going in multiple geographies it's up to you, and correctness and summary being if I was to actually ask to ever pick up a reliable system, you would only be able to pick two of those four attributes and it's almost impossible to pick three as well or let alone four now you would say that my entire system is this, no it's incorrect your multiple parts of the system may use multiple such constructs, like while your database may only pick consistency and low latency, it your other aspect of the system may decide to choose availability and economical, you know it depends on you, yeah that's it, thank you