 Thank you Todd, so there are a lot of buzzwords in the title, I will dissect them one by one as we go into the talk about me a lot of my time is spent in performance analysis and trying to bring down Amazon servers not obviously not in production but our test servers and understanding where they can fail and designing mechanisms to provide that to prevent that failure to start with what is a circuit breaker I hope everybody here knows that it comes from electrical engineering background it's basically a device which stops electrical current it stops electric current from flowing into your equipment in case there is a power surge or your electrical equipment is drawing too much current so the aim is to protect your equipment from failing similarly here we are going to use that same mechanism to protect our web servers so next is what is a microservice so I will not go into depth of explaining those I will take an example so assume you have a bookstore e-commerce website which is incidentally what Amazon started with I took some four components that could define your website users orders books and payments so how would you develop it in the first go you would have everything running on your server and all of these apps will communicate to each other because they are on the same server they don't need any specialized form of communication what is the disadvantage this is called a monolithic based model what are the disadvantages of this model of course if you are a team of hundred people working on a same server things are going to get messy you're not going to know who's doing what at what time second thing is how your API is work every any small change in users will need to be consumed accordingly in all of your other apps other things could be deploying this whole system is going to take a lot of time it's going to be a lot of code sitting at one piece and every time you have to deploy this whole system in your server and assume you have a bug in save one of your apps then your whole server is going to not respond then we go towards the microservices based model here same thing we have a main web application which is going to serve your users but at the same time the way it talks to all the services is using HTTP and they don't reside on the same server so what would be the advantage of this would be same the disadvantages we discussed in this monolithic model all of these components are independently deployable you can just have a new instance spin up for books if books is getting too much load similarly other things could be if your book services probably down due to some issue all the remaining services can still work and your website can still partially work so this is a very simple introduction to what a microservices model would be like and this would be a model that I take in the slides there will be a main web application which is going to be fronting your customers and it would be talking to a few microservices so I'll start with a few principles on how would you design a cloud based application or a microservices based application though the main point to stress here is to design for failures that is the first thing that should be in your mind so with when there is a business like an e-commerce business designing for failures becomes a requirement instead of something good to have so taking an example a 20 minute outage on just amazon.com is going to cause the company $3.75 million so that's a lot of money so some principles on how I would approach designing for failures so the first two points are supposedly obvious do not fail if your microservice goes down and application should not forever keep waiting for your microservice to respond so if you think about it for a little while you'll figure out okay we can just use a time out in our application to be able to do this so nothing big deal here so next would be three points which are not very obvious but they are very very critical so the first is to contain and isolate failures what do I mean by contain failures is if your one microservice is down or it is facing some latency issues not all of your users should be affected you need to this is not just related to microservices it's related to any cloud based model not all your users should be impacted by outages that should be your best effort respect the service when it is low so when you won't own hundreds of microservices and one of them is not responding you your application should be smart enough to not bombard that service with millions of request finally your company owns that microservice and you're not helping by sending a lot of request to it you're just making things worse last is fail fast recover fast which is the principle behind the circuit breaker model so of course if you if you have used electric circuit breakers there how do you solve your problems you figure out okay something is wrong then you call in electrician he does the checking if everything is fine he turns back the current in this case you cannot do that not every time somebody can go and turn on your application once your microservices recover that should be an automated process and it should be as fast as is possible so I'll take an example of a simple multi threaded web app if you haven't worked with a lot of languages like Java you wouldn't have seen it but it is not very complicated so how this application works is this has two microservices users and books and this is your main web application there are set of request that come in from users to the web application there is a dedicated HTTP thread pool here which we can ignore for the stock it is just because it has there is somebody who is supposed to manage the lower level connections and there's an application thread pool what is the application thread pool it basically takes in your request and computes the response that it is supposed to send to the user and at the same time it asynchronously calls users and books of course you don't need a call to books to fail if users is taking too much time so you do it at the same time asynchronously now some numbers here so we are going to do some math here but it's going to be very simple eight grader math so the same web application some numbers that are from a real-world case that we had average latency of a services say hundred milliseconds the timeout that we keep for these services is one second how do we get this timeout simply the SLA for the team who owns the service decides the SLA request per second that is the incoming request to your server that is hundred and application threads are hundred to start with that is a very decent number we actually had it in production for a lot of time without any issues and we could serve a lot more than hundred requests so what happens when the user service is unavailable what is unavailable it might be very very slow or might not respond at all we send a request to it and we are going to timeout based on the timeout we had so some questions here I am going to run through some math here very simple math so I would need some attention so what would be the average latency of a incoming request so the minimum we don't know the average here what would be the minimum latency in this case the user service is going to fail after one second that is the timeout so at least one second is needed for every request to go through so that would be the minimum latency average latency would of course be larger than one second your application still needs to compute the response also we are assuming that book service responds fairly fast so not a lot of issue from the book service but we are going to compute and create a response for the user so it is probably going to take more than one second overall what would be the request served received by the dependent service the user service of course we didn't have any guardrail here to check oh user service is not available so we shouldn't send it a request so we did send all the request that came into our application to the user service now this is an interesting point request per second that your application can serve remember we had an application thread pool size of 100 threads so simply assume 100 requests came in at the same time which can be a real world case to the application thread pool and all 100 threads in the application thread pool are going to be busy calling the user service so your whole application thread pool is used for this one second timeout so you cannot serve anything more than 100 requests and you are not even going to be able to serve 100 requests because as soon as you timeout after one second you are still going to need to do some work so you are not going to be able to serve the incoming request per second of 100 while your application thread pool was designed to serve say 400 requests in a happy case and assume a case a simple case you just wanted to browse books on the page the user service was just called to show you information like your username on top probably your cart information and you probably didn't need this for browsing the books page at all you could have just not called the service and somehow rendered the page and it would have been better so coming back to the points that I had mentioned earlier designing for failures did we do that of course we did the first two because we had a timeout were we able to contain failures we sent all the incoming requests were sent to the took more than one second so we did not contain failures for the users every single user using that website was affected respect the service when it is slow we didn't do that we called service a for every single request that application received and finally did we fail fast I don't think so did we recover fast this is very important here so if you had 100 requests coming in at one second and assume was best case you were able to serve 90 within that one second and the next second again you got 100 requests there are going to be 10 requests still in the pool which means we are in this HTTP thread pool we are going to add these 10 requests that will serve it later and they are going to build up the queue with time and what happens if the your services just down for say 60 seconds so you'll have 16 to 10 requests as pending that is 600 requests and even if the user service comes up after that time you need to process those 600 requests first and this thing is called back pressure on your application you are so you are supposed to serve request which came at an earlier stage instead of the ones that you are receiving now so even the cases where your requests are coming in later are going to get affected now look at the circuit breaker mechanism this is the normal case what happened there all request went into that service now going to the circuit breaker analogy circuit open would mean that we have an overload of on service a so let's not send all request we send say a fraction of request which timed out and then we rejected all the request directly so it was short circuited here and directly returned a response without waiting that time for the time out and as soon as the service comes back up the circuit gets closed again now implementation detail I'll explain in the next slide how this works we have two thread pools here for each of the services which are in addition to the application thread pool how many threads would we keep for these thread pools a very simple math we did here this is a good starting point we are serving 100 requests per second latency for the services 100 milliseconds so we use 10 threads for each of these pools again let's analyze what happens when user service is down okay user service is unavailable the again we got 100 requests into the pool the first 10 requests which got which filled the spool are going to get timed out after one second but because they occupied the spool rest all of these requests are going to get rejected directly so 90 of your 100 request directly got rejected and they did not need to call the user service at all and the application keeps working as well as it can so only 10 of your 100 users did get affected also we'd want to know okay good mechanism how fast can we recover with this so we have 10 requests in this pool assume at the next very next second you got 100 more requests and user services back up as soon as these earlier 10 requests get timed out and the next set of requests going to this pool will go back into the happy case where the service thread pool is functioning as normally or none of the threads are blocked for timeouts and we are going to call the user service as well so how fast did we recover as fast as one second we did recover faster than one second but for simple simplicity of math we did recover within one second so did we design for failures here if you try to fill in all these gaps yes we did I'll skip this slide so if you try to fit in that model in all of these points you would realize yes you did so this is probably the most important slide in this and probably how you would start with start in a real world how would you test this mechanism so before you actually build this mechanism and test it you would actually want to know what issues are there in your application so the way I started is using IP tables take one microservice as a at a time see what happens when it fails how do you simulate failures so two simple examples here IP address based dropping all incoming packets so this is the IP address for the services and we make Linux drop every single incoming packet similarly if you know the port for the service and it's a unique port drop every single incoming port and finally monitoring once you have this mechanism in place how many requests were rejected from this pool if you remember we had a rejection mechanism here majority of the requests are going to get rejected if there is a failure in the microservice and yeah why would you want to monitor it in a happy case this number should always be zero because this would be how unoptimized your mechanism is and how many actual users got affected due to this yeah that's it thank you so acknowledgments Alex he is my mentor he has a lot more Twitter content than I do you can go ahead with questions yeah since the time is up so thanks Kunal for your presentation I'll be available here yeah Kunal would be available you can ask questions offline yep thank you and the slides are here if you want the slides