 Sure. Thanks, Anand. Thanks, Anand. Hello everyone. Glad to be here. I personally love this paper quite a bit. I'm happy to have this session again. So, is it possible to do a quick poll and probably see how many people have read the paper? Anand, is it possible to do that? I think the best thing might be for everybody to raise their hands for now. Okay. We'll just dive in. Folks, this is a paper that came out sometime in 2013 in the ACM publication. It was published by Jeffrey Dean and Libby Andrew Borosso. I think all of us who have been into the programming and the computing community know that Jeffrey Dean is one of the godfathers and some of the leading authorities on large-scale distributed systems. So, definitely this paper is a goldmine for anyone who is looking for understanding concepts in terms of what concepts to be aware of when you are scaling up your systems in the long run. So, let's just dive in. Right. So, I'm assuming that everybody understands what percentiles are, but in case just for anyone who's not aware. So, a percentile statistic essentially is a measure that indicates the value under which a given percentage of observations in a group fall. That's more of a theoretical definition that I picked up from Wikipedia. But in the context of our discussion, we're talking about percentile latencies. So, when we say about percentile latency, what that means is the latency number below which a certain percentage of requests will fall. So, let's assume that I say my system has a 95th percentile of 100 milliseconds. It essentially means that 95% of the requests that I process through this system in a given time window, all of them are getting served under 100 milliseconds. Now, one point that comes quite frequently whenever we discuss percentile latencies is what is the time window or the duration in which these latency measures are taken. So, that depends on system-to-system and scale-to-scale. Some companies try to measure percentile latencies at a one-minute granularity, some measure it at a hourly granularity, and some of them measure at a day-level granularity. Now, that's a subjective thing. But what my takeaway is that as the system scale, as the volume of traffic that your system is handling at a per second, per minute, per hour level, the granularity of the window in which you measure the latency is reducing. And the reason for that is as you're operating at a higher volume, you are getting more and more concurrent users on your system. So, even for a smaller duration of an outage or a smaller blip in your system, more number of users get impacted. So, typically, higher the scale, the smaller the window in which you measure this percentile latencies. So, now, since you have understood the percentile and the latency concept, what exactly is meant by tail latency? So, what tail latency means is the last 0.x or it can be x as well, need not be 0.x percentage of the request latency distribution that you have in your system. So, typically, we talk about the slowest 1% response times or the 99% response times as the tail latency of any system. However, it can again vary from scale of the system that you're working on. So, from my personal experience, while I was at N-movie, we were processing at a rate of almost 70 to 80,000 requests every second. At a day, we were processing 8 billion requests a day. So, we were optimizing our systems for 99.5% latency at capillary, the current optimization I work with. So, we are processing about 500 billion requests a day and we are currently optimizing at the 99% latency. About three years back, when the system was still running at 80 to 100 million requests a day, we were optimizing for the 98 and the 99%. So, as the system is scaling up, we are also tightening the definition of the tail latency that we have on the system. Now, what the paper claims is that for responsive and interactive systems, the latency has a direct impact on the user experience. Lower the latency of the system, the more fluidity and consequently, you get a higher engagement on the product as well. Because there is no jerks or there are no jitters coming while the customers are using your products, essentially. And another characteristic which is very, very important and which will be relevant for the subsequent slides as well is that the responsive and the interactive systems that we are talking about in these systems, the reads outnumber the writes by a very, very high degree. So, we are talking about like I am saying like millions of reads for probably just one or two write operations essentially. And the benchmark number that the paper gives is that if your systems are able to respond under 100 milliseconds, you can ensure a good experience to your end users. But again, since this paper came out way back in 2013, I am sure for larger companies like Google and Facebook, this number of 100 milliseconds would have reduced even further. And why should we care about tail latency? So, tail latency becomes important as the system scales and system scale increases. So, just for a simple illustration, let's assume that I have a system which is processing 10,000 requests a day and I end up scaling it to probably 100 million requests a day, maybe over a few years or months or whatever that could be. So, at an earlier stage when my slowest 1% of the requests would be definitely 100 requests would be performing badly in a day. When I go to 100 million requests a day, this number goes to 1 million requests. So, essentially the number of users who end up getting impacted also increases as the scale of your system increases. So, that's the reason why it is very, very important to optimize your tail latency as the system volume increases. So, this is one of the codes that I picked up from the paper and it kind of drives the point through. So, I think in the distributed computing ecosystem, we talk a lot about fault tolerance, failure, resiliency and all. So, just like fault tolerance computing means that you build a reliable whole with less reliable parts. The latency tolerance system means that you try to build a system which is tolerant to the tail latencies while looking at the less tolerant subparts, essentially, or less predictable parts. And these kind of systems you refer to as tail tolerant systems. So, when we look at the statement, the first instinctive response can be that we might need a very, very large amount of infra to ensure that you're optimizing for the last 0.x% of the request. However, one of the assertions that the paper makes and they also go on to substantiate with the empirical data is that many of the tail tolerant methods that the paper proposes are able to leverage the infra that we configure or provision for fault tolerance as well. And overall, they give a higher utilization or more efficiency on the infra that we configure. So, before we move forward, the question that we need to ask ourselves is, why does the variability exist in the latency of any request or any system? So, that's a very good question that we need to understand very well because this understanding is when we design the techniques to kind of alleviate their effects. So, the first variability that typically comes is from shared resources. So, now this becomes even more prominent in the cloud ecosystem. When we started moving to AWS, GCP, Azure or the Ali cloud or the plenty of cloud providers that are there in the world. So, your applications could be co-hosted with plenty of other applications and there can be a lot of resource contention that can happen. However, with the advent of containers, you can define isolation boundaries but what we need to be aware of is that even though you are running your containers, the virtual machine on which the container is running is still co-hosted on a hypervisor, which is again hosting another virtual machine of another neighbor. So, you can still run into noisy neighbor issues. Or there could be within app contentions as well in the sense that in the same application or a web server container, I could have a request which is responding for data for 1000 users and there can be another request which is responding for data for only one user. So, obviously the first request can end up having an adverse effect on the second request as well. So, shared resources definitely add a lot of variability in the response times that we see. The second problem can be your demons and the background jobs that can run. Now, this is a problem I face on plenty of occasions in my experience at all the companies that have been part of it. So, we have these background of back jobs that run and some of the times their execution and the amount of computing or the amount of batches they will process can be non-deterministic. It's very hard to predict what's going to happen when these background demons start executing. So, they might start impacting your real-time systems or real-time requests in a bad way. So, whatever the first two points that we covered so far, they talk about the applications which are within the context or within the control of the application developers. Next, we come at global resources. Now, when we talk about global resources, we are talking about the over-underlying infra-components like your network switches. They can be a shared file system. For instance, in AWS, we use the elastic block store. Essentially, that's a rated store which is, again, you're using a network which proves you connect to a shared network. So, it happens quite frequently that another noisy neighbor who's attaching to the same network, they start consuming more network bandwidth. The rights to your EBS volumes can slow down significantly. So, shared file system network switches. Now, this problem happens quite a bit if we have worked with the rack storage where the top of the rack switch, one box in the rack can actually end up consuming a lot more network bandwidth and it kind of starved the other boxes on the same rack. Then, other maintenance activities can happen in the background like lock-on-action, data-shuffling. This is something which is very, very prominent in a lot of non-relational databases like Cassandra, HBS, React, and other databases. We will be having a reference of that in subsequent slides. Then, you have the different level of queues that you have in your infrastructure. They could be at the network level or TCP queues. They can be at OS kernels or there can be intermediate hops among the racks. Then, there are other aspects as well that the paper talks about, but I will probably quickly skip through them because even as I personally have not been able to kind of appreciate that very much, the reason being, at least in the last decade, we have directly worked with the virtual machines. So, we haven't got the chance to work with directly data center level hardware. So, a lot of these points kind of relate to that, except for the garbage collection part. So, anybody who has worked on garbage collector or virtual machine managed languages, Java, Golang or other languages, you will see that garbage collectors can definitely come as a background process which can starve your application for CPU and essentially. So, that can add a lot of variability on the system. And what happens is as a system scale and as a software architecture scales up, what we see is that the number of components and the number of parallel execution parts that you have in the system can also increase. Now, you can characterize these parallel execution parts either into microservices when we're talking about the request of the application part or it could be due to the data partitions or data sharding that you end up doing in your storage layer. So, essentially what happens is the number of concurrent or parallel operations that you might do, they end up increasing as the system scales. And you also end up doing a lot of fan outs for processing a single request. So, let's assume that in a case of Google in the context of the paper, when you type a search query, there are multiple application services running in the background. One could be a spelling correction, one could be entity extractions and one could be another, I'm sure the Google search index is like petabytes or maybe exabytes of data. Obviously, it's going to spread across thousands of data partitions. So, to serve a single request, I may have to fan out my request to probably thousands of servers to pick the results from there and do a ranking and sorting and respond back. So, what happens is when I have such a large fan out kind of access pattern, the overall latency that I see at a request level will definitely be higher than the latency of the sluice component. That's a basic mathematical construct that we have here. And even if you look at a simple mathematical formulation, we see is that if you're touching a single server which has a 99 percentile of one second, average of one millisecond, we see that one percent of the request might end up taking more than one second. So, what we're talking about here is there's a one percent chance that your request will take more than one second if you're hitting a single server. Now, just multiply it by the number of servers that a single request might be touching, bringing in the fan out aspect over here. So, a single request which is patching 100 servers in its path, the chance of this request taking more than one second increased from one percent to 63 percent. That essentially means that you are at a top level, your percentile latency numbers will definitely get skewed up. That's why it's very, very important to kind of focus on the tail latency and optimizing them. So, that is the empirical data that the paper presents in terms of how the percentile latency changes as the number of servers that come in a request path increase. So, you can probably see that for 500 servers, 1,000, 1,500, 2,000 servers and the P service latency is given on the y-axis. You probably see that as the number of servers are increasing, even though the percentile latency that our service is committing is lower, the effect kind of starts degrading, the effect kind of degrades exponentially. So, again, that's an empirical data that they have given is that, let's assume at Google scale, you have a 10 millisecond, 99 percentile for single request and the 99 percentile for all requests is 40 millisecond and the 90 percentile is 70 millisecond. So, what we infer from here is that waiting for the slowest 5 percent of the request to complete kind of comprises for almost half of the slowness in the overall system. So, the techniques that can focus on optimizing the slow outliers can definitely yield into a lot more efficiency and overall service performance that we see. So, I think I'll take a second pause here. Any questions on this? I don't see any questions yet. Okay, sure. Let me move forward. So, now that we have understood the possible reasons and the possible causes of variations in the response latency. So, the first question that comes up in our mind can be that, are there any ways to control the variability in the system? Let's do a deep dive into that. So, some of the practices that the paper recommends and over the years I've been able to kind of see that some other companies and other systems have also proposed the similar techniques. So, some of the first technique that the paper proposes is that, can we prioritize interactive requests over real-time request or real-time request over background request? Essentially, interactive or real-time request can be that a user is actually using a system in real-time and he's expecting the response to come back versus the background request which could be because of a crunch of running in the background or a bad system running in the background but both of them are trying to access the same service in the back. So, they go on to explain the example of the Google file system in which what they do is that at a Google file system, what they have done is that rather than letting the OS kernel or the network kernel decide which request should be processed, they pass on this decision making to the Google file system's kernel and the Google file system kernel decides that if a request is coming for a single key on the Google file system, they would prioritize that over a scanned request which is coming from background out. And, incidentally, a similar approach is proposed by AWS. So, I've already given a reference over here. Where they say that the prioritize the request in a different fashion, like not every request is prioritize similarly. The example that they give is that if you're trying to do a scan and you're trying to do a paginated operation on an ABI, they would rather prioritize a request for a, the prioritize the request for a later page over a request for a first page. Because what happens in that case is that the client for which you failed the request for the last page, they will end up doing a complete scan of all the pages once again. So, they start rejecting the request for the first page in case they see a load coming in the system. Similar concept, however, for a different objective. But the idea is that we can come up with techniques where not every request can be prioritize in the same way. And that knowledge can be embedded into your application servers to figure out and kind of alleviate the latency concerns or even in the case of load setting as done by AWS. The second technique can be that we try to reduce the head of line blocking. What this refers to is that you kind of break your long running request or request which are going to do a lot of scans or going to do a lot of operations into sequence of smaller requests and you kind of fire these requests after some interval or you kind of interleave the request so that other short-lived requests get a chance to get a chance to consume the resources that the application is giving. The example over here can be that let's assume that I fire a search query for Christian or Ronaldo. So, I know that this kind of a query is going to end up hitting a lot of charts and I'm going to get maybe billions if not billions of results. And now this query can be broken down into multiple queries underneath and these queries can be fired in an interleaving way across the multiple index charts that you have. So, that search queries that are coming for other maybe more obtuse kind of a search query is Christian Ronaldo performance in the FIFA World Cup. Obviously, this kind of a search query is going to yield that number of results. So, this kind of query is sneaking into the index charts. So, another example over here can be mapped to the pagination example that I gave earlier. So, if you're trying to scan a lot of data instead of having large, large page sizes of maybe 1,000, 2,000, 3,000 records, you kind of break your request into multiple pages and kind of let your clients to iterate over the pages. So, this is a technique that we have applied at Capitality. So, we have a loyalty wallet system which is kind of acts like a money or a banking system. Over there, we don't allow any customer to kind of scan more than 50 requests a day. Because essentially, we end up scanning the points of the financial ledger of a user and that ledger can actually grow into thousands of millions of records. So, that kind of allows us to ensure that all the requests are getting a chance to consume the service that we are exposing. Another technique that they propose is that this kind of ties up to the background operations that we referred to earlier that the ground jobs or background batch services can come and kind of disrupt your real-time systems is that, you know, you kind of synchronize the background operations in some form, right, rather than having these background jobs running at random points in time and kind of disrupting the services at will. We can kind of either we can throttle these background jobs by using some techniques or we can also kind of synchronize these background jobs to run at different times or maybe at the same time as well. So, the paper research is that, you know, what they have found is that if all the background jobs run at the same point in time across all of their systems, the overall tail latency that they see is much better than the case when these background jobs are running at random times. What they see is that when the batch job runs in the background, maybe for a five-minute period, you will see all of the requests are slowing down but for the rest of the time slices, the rest of the time scale, you will not see them slowing down at all. Rather than scattering your background jobs throughout the day or throughout the time, you kind of synchronize them together. So, a similar technique that we have applied at Capital E is that in some of our systems we have these security scans that run in the night as part of we have kind of synchronized them to run at a fixed time window which is roughly around 4 a.m. in the morning which is the off-business house where some requests flow down during that window. The overall performance impact on the system is much lower. And another claim and the paper makes which kind of makes sense if you spend a couple of minutes on it is that caching does not impact variability much unless the whole data is hiding in a cache which definitely makes sense that I am talking about data in the order of terabytes or petabytes. So obviously the cost of memory and RAM is kind of reducing over the years but still assuming that I will be having petabytes or terabytes of cache or RAM available to me seems slightly unfeasible especially from a business and cost perspective. So obviously since you cannot cache all of the data in memory or in RAM the impact on variability will not be controlled by the caching techniques. So variability is inevitable you cannot escape it. So we have to kind of live with it and understand that variability is going to exist in any kind of software system that we build. No matter whether we are building it on the cloud, we are using containers or we are using a private cloud or we are having a private infrastructure or whatever it may be. So how do we handle it and how do we find techniques to cater to this variability. So especially with the advent of cloud when the number of shared components that you have can increase and they are totally non predictable the variability is going to be much much higher. So what the paper goes on to explain is that two classes of techniques which they call as tail tolerant techniques. The first class that they call is the within request immediate response adaptations. So these class of techniques focus on reducing the variations within a single request path. Essentially what we are saying is that I have a single request which is flowing down my service graph. So when I say service graph what I mean is that when we mentioned that there can be fan outs in the system we can have multiple microservices which are getting invoked in a request path. So what we are trying to do in this class of techniques is that optimize the variability for only this one particular request. So basically these techniques are effective when you are looking at a time scale of just 10 to 20 milliseconds or maybe the 95 to the 99 percentile that we are looking for in a single request essentially. And the second class of techniques that they talk about is called cross request long term adaptations. So what they essentially mean is that you look at measures which can help you optimize the tail latency at the whole system level. So we are not talking about a single request or a chunk of request. We are talking about the overall system and we are talking about techniques which are going to impact each and every request that flows through your system. So the time scale over here we talk about basically goes from 10 seconds to maybe even higher. So let's jump into the within request and immediate response adaptations which are where we are trying to optimize a single request in particular. So what this techniques kind of aim towards is that you are trying to cope with a slower subsystem in the context of a higher level request. Essentially I have a request which is now it gets fanned out to multiple microservices or subservices. So we are trying to alleviate the variability or the degradation that happens due to the bad performance of one of the downstream systems on the overall top level request. That is what essentially means. And the time scale is right now. The user is still waiting and they are expecting a response to come back because it is the context of interactive systems essentially. And now I will try to I would like to kind of bring back the point I had made initially that we are talking about interactive systems where the number of read requests outnumber the right request. So when you are talking about a lot of read requests in a system, one of the common approaches is that you create multiple replicas for having more read throughput capacity on your system. Now these replicas would be of your application services or they could also be for your databases or data storage layers they could also be for your caching layers essentially. So this definitely gives you a lot more availability in the presence of failures and we will see how it can also help you in terms of reducing the latency as well. And the approach of adding more replicas for optimizing your reads is particularly effective when most requests operate on a largely read-only loosely consistent datasets. So this is a very important term to understand over here so I would rather take a quick one minute pause over here and just to see if anyone has any questions on this that why these techniques are effective for largely read-only and loosely consistent datasets.