 Good evening, everyone. So this is a talk about a solution that we've developed here with my students on how do you improve user experience when web servers are overloaded. So this is joint work with a number of my students, a couple of BTEC students and a MTech R&D student who graduated and my current set of MTP students, Stanley, Murli and Ramli. So coming to this problem of web server overload, so I'm sure all of you have experienced it at some point. It's a fairly standard thing. As Shivram has said in the introductory email, you've booked tickets on Tatkal. You've probably seen IRCTC crash on you a couple of times, right? So what is this problem? So we have a web server that has, say, a nominal capacity of five requests per second. That is, it can process five requests per second and we are specifically concerned with web servers that are serving dynamic content, you know, not static web pages that can be offloaded onto CDNs and things like that, but websites like, you know, a ticketing portal, IRCTC, a travel website, various things that when a user request comes in, the web server has to expend some resources, you know, do some contact the database, do some calculations and return a result due to which it has a finite capacity, a low capacity. Say in the Stoy example, it has a capacity of five requests per second. So five users per second come in, they are served. But more users than that come in, in any given second, initially the queue starts to build up. The server has a small queue. The queue starts to build up. The queue overflows and eventually the users coming in beyond the capacity will not get served, right? They will perceive what is called a website crash. And note that this is not just a problem about web server provisioning, right? You could say, well, you know, build your web servers so that they can handle whatever load you're going to get. So it is not just a problem of provisioning because frequently transient overloads happen even for the most well-provisioned servers. For example, the classic case of IRCTC, so the web server, the IT infrastructure is actually pretty well-designed to handle a few thousand requests per second. But at peak load, when this Tathkal booking opens, they get a few lakhs of requests per second for a short period of time. So it does not make economical sense to actually provision the server for this very occasional peak load. And therefore, websites have to be designed to handle these transient overloads, even if the web server is actually well-provisioned to handle the average load in the long term, right? So this is a problem that will not go away with just proper provisioning. And the end result of this is poor user experience, right? People see websites crash. So there have been some solutions, obviously web server design is an old problem and people have looked at various solutions to web server overload. So the classic solution is there are several techniques to improve server capabilities. For example, if you have a server with capacity five requests per second, add another server replica, load balance your traffic across these two and your web server capacity doubles, right? So there are several examples where you can do load balancing at layer seven, layer four, layer three. There are various ways in which you can actually improve server capacity by load balancing across replicas. And most modern web servers actually run over several replicas in data centers. So this is one way to avert web server overload. The other way is, of course, put in some kind of a proxy that does admission control, right? Whatever excess load comes up beyond what the server can handle, you put something that denies admission to users. You don't let this extra traffic even reach the web server. So these are about the two classes of solutions roughly that have been proposed for the problem of web server overload. You either beef up your server and beyond whatever limit you've set for the server, you turn down the rest of the traffic, right? So now what about these users that get turned down? What happens to their user experience? Well, if you are lucky enough and if the proxy is polite enough, it will accept your HTTP request and return a response, a 503 page service unavailable response to you, right? And sometimes if the guy doing the admission control is even too busy to do that, your TCP connection simply times out. And you see this connection timed out page, right? So either your HTTP connection fails or your TCP connection fails and this guy says try again, right? But at this point, there's really no methodical way. When should the user retry? Fine, the server is overloaded, you've hit capacity and there are all these users who are being turned down. There is no guarantee of eventual service. There's no methodical way in which the user is told please try now and you will get served, right? So people keep hitting refresh, refresh, refresh and if you're lucky you get through at some point, right? So that is how retries are today. So our idea is to build something that tries to improve the user experience for in periods of transient overload, right? For example, if a web server was using our web queue system, the user interface would look like this. If you reach the server at a point where it is overloaded, you will not get a 503 unavailable or a time out. In fact, you'll be put in a virtual queue of users. You'll be actually told there's so many people ahead of you. This is your wait time in the queue. You'll get a HTTP auto refresh page. That will take you to the website after the wait time has expired. And obviously when you reach the server, you will get guaranteed service. So what we're trying to do is simple. If you've been to a bank, you've received one of those tokens, right? It's just that. You are actually, there's this burst of traffic. We are just smoothing out this traffic to match server capacity so that transient overload is averted, right? And we do this by assigning wait times. We, instead of returning a service and available HTTP page, we return a web page that auto redirects after a certain wait time. So obviously all the magic is in how do you compute this wait time and make all this scalable and all. But this is the basic user interface that the user will see. Is that clear, any questions? Right, so this is a high level architecture. I will go into each of the pieces later. So how does all of this work? So between the user and the web server, both of which don't have to be aware of, yes, yes, I will cover that. Just give me a couple of slides, I will get there, right? So for a web Q enabled website, you would ideally not have to hit refresh. You will go to the website. If it is not overloaded, you will go through. Otherwise you will get a refresh page and you better wait that refresh time. There's no incentive. There is a disincentive to hitting refresh. I will get to that, right? So how does the system work? It has two entities, what we call a token gen and a token check. So initially when the website is, a web server will choose to redirect some of its queries through our web Q proxy, the token gen proxy, right? So how does this happen? This is pretty standard. You use DNS, I mean, websites today host a lot of content regularly on content distribution networks, CDNs. So when you know that a certain URL will generate a resource incentive, resource intensive request, say the user has filled up a page with all the details of his train ticket that he want to book. When you hit submit, that URL will not go directly to the web server. But in fact, that URL will point to a token gen proxy, right? The user's request will be redirected through the web Q proxy. So this is about the only thing that the website administrator has to do. He has to decide, these URLs generate a lot of load. I want them to go through this proxy so that I don't get overloaded, right? So once the URLs come here, token gen will assign a wait time to the request. And if the website is not overloaded, you will get a zero wait time. You can immediately go to the server. If the website is overloaded, you will get whatever you are put in a queue. You will get a wait time according to when you turned up. And the user's browser waits for the desired duration of time. And then it tries to go to the web server, at which point another proxy that we call token check simply acts as a simple inline proxy. And intercepts the request, sends it to the web server, and hands the response back to the client. And these two proxies talk to each other. They convey some feedback to each other about the server performance, and do some basic capacity estimation. So for example, token gen needs to know at what rate I should send request to the server, right? So token check will give it basic feedback, like now the server looks overloaded. It's taking a lot of time to respond to requests, so slow down. So they do some adaptive calculations between them to actually shape traffic in this manner, right? So a couple more slides that will answer some of the missing pieces here. So high level, there are two proxies. The first proxy that the user goes to simply gives you a wait time, and the other proxy just lets you through to the server, right? So why split them into two? Because they do two different functions. Maybe after the next slide it will become a bit clear. But the primary reason is that is an inline proxy, right? And it is easy for that to get overloaded. On the other hand, this proxy is an out of band proxy. As soon as a request comes, it simply returns a wait time does not keep any state. So the reason for the separation is scalability, because each one needs to do, each one is doing a different job and needs to scale separately, right? So that will become a bit more clear in a little while. One more slide please, okay? It will become clear. So how is this wait time computed? It is, at a high level, the idea is straightforward. A wait time is assigned in seconds. We return a HTTP refresh page in seconds. So token gen knows the capacity of the server, right? So there is a capacity estimation module running that knows the capacity of the server. How that is done again, we will come to that a little bit later. So what token gen does is it knows this capacity of five requests per second. And it tries to schedule five requests per second to the server. And it remembers how many requests it has scheduled to the server into the future. So it maintains an array where the ith element says, I seconds into the future, I've already told five guys to go to the server. Three seconds into the future, four seconds into the future. It maintains this wait time array, right? So when a new user comes in, it finds the first slot in the array that is available. So one, two, three, four, five, the six seconds from now on, I've only sent two users so far. I could potentially send up to five users, given that's the capacity. So it fills up that entry and returns a wait time of six, whatever, you know? So this array, each entry corresponds to how many users have scheduled to the server at that ith second. So since we're dealing with wait times in the granularity of seconds, if you deal with any other granularity, obviously, the way this array is structured will change, right? So how is this capacity done? And we have a way to do it in a distributed fashion. Again, I will come to that a little bit later. So the main question here is, can users jump the queue, right? What if I get an HTTP refresh page that says, redirect after 20 seconds, what if I hit refresh again? So one way is, you know, you keep per user state, you note down the IP address of every user and these two proxies talk to each other and somehow prevent users from jumping the queue, which is highly resource intensive and it is hard to scale this, right? So we would like to do this without maintaining any per user state. So how do we do that? These proxies share a secret key, whenever a user shows up, token gen, along with when it computes the wait time, it also computes a token, which is basically a hashed MAC or, you know, a hash based on the secret key of these three things, the user's IP address, the wait time and the current time that the user showed up at. If I showed up at, you know, time 150 seconds, I'm assigned a wait time of five seconds and something is my IP address. I compute a hashed MAC over all of these things, so that is my token. And this token, I return to the user and beyond that, that's it. Token gen forgets about the user, it does not keep any per user state, right? So how is this, and this token is embedded into the URL that is returned to the user, you know, your HTTP refresh after 20 seconds and you have the token, you have the wait time. So after you wait your time and you go to token check, what does token check do? It extracts whatever wait time, time stamp that the user is reporting. And it recomputes the hashed MAC, right? It knows the secret key, it recomputes the hashed MAC and checks that the time that the user showed up at is actually equal to the time stamp and the wait time that he's reporting. So what happens if I try and manipulate the time stamp and the wait time, if I'm told to wait for 20 seconds, but I actually change my URL and only wait for five seconds and show up at token check, this hash will not match, right? And assuming that, you know, you cannot generate a hashed MAC without having the secret key, and assuming you do not have the secret key, there's no way user can show up at a slot other than his assigned slot. Right? Does that answer? So when you are told to go after 20 seconds, your browser automatically redirects after 20 seconds, you better only go after 20 seconds. Because you try to go earlier or you try to go later and reuse the token at a later point, it will not work, right? So there is some granularity of one second or whatever granularity at which you assign wait times. Your, this token checking will only work at that granularity. Yes? Yes. It is pretty lightweight. Yes, it is assumed that this is lightweight and HMAX are, you know, symmetric key cryptography is pretty lightweight anyways. And so we, from what we've measured, this adds very minimal CP overhead and latency to the calculations. Yes? Like what kind of does that? You could do that even today. So I mean, the classic response to all of these questions in security literature, you're not opening up any more vectors for attack beyond what exists today, right? You can flood a server even today. So this is not a solution against application layer DDoS attacks, right? So all this will do is fine. If you're getting a DDoS attack, you're getting a spike of traffic. It will only smooth it out. Sure, if you filled up the queue into the future, you filled up the queue, right? So this is, this token helps us actually ensure that users follow this wait time that we've prescribed and yet without maintaining any per user state for scalability, right? So what are sort of the pros and cons of this design? So first thing is this can be deployed as a third party service on demand. Note that we do not much like how a CDN is there. So why are CDNs so popular today? Because you can sort of amortize the cost of this content distribution across many websites, right? So similarly, something like this can be done. So this thing can be a third party service in a cloud or whatever and websites subscribe to this when they expect overload or only at the time of overload and this does not touch any aspects of the clients or the server. So it can be a third party service sitting in the middle completely on demand. And so this entity token gen is the only one that will face overload. Note that token check, once you've smoothed out the burst, scaling token check is easy because token check only needs to be able to handle as much load as the actual server. If your actual server is doing 200 requests per second, token check only needs to handle 200 requests per second. Because if all this wait time mechanism is working fine, it should never see more load than what the server is already seeing. So scaling token check is very easy. You can scale it much like you scale the server. And in fact, if website admins so desire, they can integrate token check functionality within the web server itself. Because in the end it is a small amount of computing a hash, which is very small computation. So this functionality can be pushed into the web server itself, right? On the other hand, token gen is the one that's going to see this huge spike of load, and for this we have a distributed scalable design so that this alone needs to be robust to overload. So this is one of the reason for the two proxy design, because the two proxies have different jobs. And scaling this is a lot easier than scaling a full fledged web server, right? So these are the other advantages is of course the clients are completely unmodified and the web server is also mostly unmodified, except for the fact that the server admin has to identify the resource intensive URLs and set up redirections. Beyond that, both endpoints are unmodified, yes? Yeah, token gen has a distributed architecture so it can scale to how much ever is needed to handle that overload, yes? Yeah, so that's my next point here. So one of the cons of this is, of course, all this will fly, assuming there is user acceptance. So a user would much rather prefer to be told that you are in a queue and wait for 20 seconds rather than hitting refresh impatiently. So that is a fundamental assumption. We have no user studies to back up this assumption, but we are assuming a known wait time with feedback is better than an undeterministic wait time even though the other thing might be shorter, if you get lucky. So if token check is seen, yes, it has to check the hash and it has to throw away the request. Those requests will not go to the server, but since the user has no incentive to do that, it is assumed that the load to token check will be at a low enough rate. But yes, if the users are doing that, then you have to worry about scaling token check as well, right? So the first assumption here is user acceptance that the user would much rather be placed in a queue and this is not that, this is for transient overload. It is not for persistent overload. If your server can only handle 1000 requests per second and you're getting 10,000 requests per second for an entire day and you're giving wait times of two years, nobody is going to use it, right? So this is assuming you have short bursts of overload, which bring servers down and which destroy user experience. And this is to only smooth out those transient overloads. And for that, we're assuming people are willing to get this user experience. And the other thing is of course this overhead of redirection is acceptable. That is if you're told you go to one proxy instead of directly going to the server that people are willing to do this redirection. Websites are willing to pay this extra cost for the benefit of having a better user experience, yes? No, that is not an assumption. I will come to that later. We are not assuming request types as I've sort of informally assumed so far in the discussion for ease of explanation, but that is not an assumption, yes, yes, exactly. So we will take request hardness into account, that's a good question, yes. Yes, or whatever C users of C is the capacity, yes. So we do have a server capacity estimation module. So if we find that the server is getting faster and we find that the capacity is increased, you can fill in the slots as well. So we are not taking this value of five as an input from the server. We are estimating it ourselves based on how fast the server is clearing requests. Yes, so when server capacity changes, so yes, that's a good point. If you filled up 5555 and suddenly you've realized the server capacity is ten. So some user who has come later might fill up this extra slack in some of the earlier slots. Yes, so when server capacity changes, you are not guaranteed first come first serve semantics, yes. But we assume that, so we have a server capacity estimation module that periodically runs and after that we assume there is a certain period for which server capacity stays constant and during this transient period that can happen, yes. So you've sent that redirect page, right? I've sent somebody a redirect page for 20 seconds into the future. Now I find that some vacancy in slot one has opened up. So I cannot go, I have no control over that guy. So, yes, that is also a possibility. You can underutilize the server if, yes, if first come first serve is so important you can always underutilize the server, yes. Yeah, problem, yes, again that is the user acceptance assumption, right? We are assuming people are fine with it and they follow the protocol. If you say come back after 20 minutes that and you're automatically redirected to the website after 20 minutes, you better be around to use the service otherwise your request doesn't get used. So one way is our server capacity will, if there is a constant rate of abandonment that will figure in the capacity calculations, right? So the capacity is calculated by observing the server response time. So if some x% of the users always abandon that will figure in our calculation and you will over provision like how airlines sort of overbook tickets by some percent assuming cancellations. So that will implicitly happen in our calculations, right? But we are not specifically accounting for that, right? No, why? Because you are in a first come first serve queue, right? So if you, if there are thousand tickets and you are the 200 guy to show up, it doesn't matter if you get service after two minutes or three minutes as long as your slot in the queue is reserved for you. So waiting for two minutes is assuming nobody else is rushing to the server and breaking the queue. Everybody follows the queue, you're still, you log in at eight o'clock, ten o'clock, whatever the time is and you should be fine, right? It's not fair only when capacity changes, but when capacity is fixed. So we are not assuming in those five minutes that Tadka opens up the server, admin is somehow reconfiguring the back end and changing capacity, right? So assuming capacity is fixed, the queue is fair, right? It doesn't matter, even if- It's basically a random, it's like a queue. That's it, it's like, it's like saying you get the line to get tokens that a bank is fine, if people rush and unruly and push you, that's life. If your network link happens to be a tad bit slower and you're behind in this queue. So the alternative is much worse, right? So it feels like you're sitting in a sweet spot between a place where you don't need to get all, right? And a situation like Tadka, on average, nobody is going to get in, right? That's not true, you've probably had very biased experience of Tadka. It's really not that bad, right? So it does work. No, no, no, it's not that bad. So, I mean, the overload is there, but it's not, assuming token gen doesn't come down with the overload, and you know, assuming some level of sort of equal distribution of network latencies, you should be in a fair enough spot in the queue. Yes? So which is why I said not all requests need to go here, right? Only the request that take up resources, the request that has to access the database, the request that has to access the payment gateway. Only that goes through a queue. And in the sequence of requests, in the various sequences, whenever, so you can treat each of them as a separate queue, or you can just put one queue in between, and once you get in, you get everything soon, right? So but the simplest case is each sequence, each step that consumes resources, you put in a queue, depending on the extent of overload you have. In a separate queue. In a separate queue, yes. Or you could choose to have, so this is orthogonal to this, where do you decide to put the queue? Assuming all users have similar profiles, and you know, take the same number of steps, you could have one queue in the beginning, or if there are also a lot more users' requests to just see the schedule and very few users' book, you could have separate queues for both the steps. But the number of resource intensive steps, so we don't intend to put static pages and forms that you have to fill here, right, because it's only the database access, those kind of things that have to go through this. Because the other things a website can presumably handle and it'll not crash handling all of that, right? So the rest of the talk, so I'll start with how we estimate capacity, how we have a distributed token gen architecture and a few results, right? So how do we estimate capacity, right? So this will cover some of all these things of the server as, yes. Why would it go down? It depends on the reason why it's going down, right? Token gen is overloaded, it goes down. In spite of the distributed architecture of 10 replicas, all 10 of them go down, well, the system goes down, right? So there's, obviously. So how do we estimate capacity? So here is some simple queuing theory or even basic system stuff, right? So here are two graphs, a good put and response time of a server. So first, consider a very simple web server that has only one type of request. And each request takes a fixed amount of resources at the server. So on the x-axis, I'm increasing the input load to the server from 20 requests per second, 40 requests per second, 60 requests, and so on. And the capacity of the server is somewhere around 100 requests per second, right? So as I increase the input load, what happens? The good put of my server, the number of requests that it's actually serving, increases as long as you are below capacity. If your capacity is 100, you give me 20 requests, I can serve all 20 requests. You give me 200 requests, I can only serve 100 requests, right? So as long as you are below capacity, good put linearly increases. Once you've hit capacity, good put flattens out, and in fact, it might even drop if users are retrying a lot. On the other hand, response time is fairly low as long as you are below capacity because there's no queuing. But once you start to get near capacity or response time, sharply increases because of queuing, right? So this is measurements from a simple web server that does a simple PHP script in the background. So this ratio of good put to response time, this is called the power ratio. And it is known to peak somewhere around capacity. Why? Because as you reach capacity, once you cross capacity, your response time increases sharply and response time is in the denominator, this is good put over response time. So this power ratio peaks somewhere around capacity, right? And so one way to find out the capacity of the server, so one way is of course to ask the server admin or to find out the service time of request, peak into the server. But if we do not want to peak into the server, we assume no server support. Then one way to do is keep sending requests at different input loads to the server and probe this power ratio at different points. And find out which point gives you close to the peak of the power ratio, right? So this is what we try to do in web queue. So if you want to probe at a certain input load, there are two ways. One is you give that input load explicitly as capacity to token gen. And make token gen schedule C request per second for some time into the future for some observation interval. And token check, because it intercepts every request to the server, it can monitor these requests and it can tell you, this is the average response time, this is the good put of the server. And it can give this feedback back to token gen, using which token gen can say, okay, this load, this was the power ratio, at this load, this was the power ratio. And try to map this curve out and find out capacity. The other way is you don't let token gen do any operation at all. It's a no op, it simply redirects whatever traffic comes in. And if you observe for a period of time, if you have distribution in the number of load levels you see in a certain period, then if you've covered good enough amount of low load, high load, if your server is fluctuating, you just observe for a period of time. You see what the good put response times are. And based on that, you can probe this power ratio at various levels. And by just observing what the server is doing. So this is sort of intuitive, right? If you see a server is getting back to you very fast, then TK, maybe you can try a higher value for capacity, or the server response time is low, you can use a lower value, right? So all of this is sort of basic queuing theory. But this breaks down if you have a real web server that has multiple types of requests. So this is the actual power ratio sample. So each point here is a 10 second interval where we've sent a certain number of requests to the server and observed the good put and response time from the server and plotted the power ratio curve. So here the backend is actually a model server that is serving two types of requests, requests to display the course webpage for courses that have two different types of contents. So here the two types of requests consume different amount of resources at the web server, so which will be the case in any real web server, right? There'll be different types of request which consume different amount of resources. And for that, if you look at the power ratio curve, it is not as neat as the one I showed you before, right? Why? So consider two points here. So this blue point here and the green point there, they both have the same amount of load in terms of requests per second. But the blue point presumably has more easy requests, which is why the response time in general from the server is lower, which is why in that epoch the average response time is lower and therefore your power ratio value is higher. In another epoch, which happened to have a lot more harder requests for the same input load, the power ratio is lower, right? So this is what would frequently happen with a real web server. And this clean idea of draw this power ratio curve, find the peak will not work. So what do you do in such cases? Excuse me. So suppose you know a priori that these two type of requests, one request is 0.25 units of work in some units, CPU time, whatever is the bottleneck resource, and the other request is one unit. If you somehow knew this relative hardness of the request, right? And in all my calculations of input load, good put response time, I no longer think in terms of request, but I think in terms of these abstract units. So whenever a type one request comes in, I add 0.5 to my input load. Whenever a type one request leaves the server, I add 0.25, whatever to the good put. So in all my calculations, if I treat request one as 0.25 request and request two as one request. If I do this scaling, if I knew this relative hardness of the request and I redraw the same points that were there in the previous graph, this is how they would look like, right? The minute you scale request by the correct relative hardness, the power ratio values once again align neatly on a curve and now once again it's easy to identify capacity, right? Somewhere around the peak of the power, yes? Yes, they can share the same bottleneck resource, right? So this is number of units in terms of the bottleneck resource, okay? Bottleneck resource is, so we are assuming, are you speaking about the relative ordering of the request within that epoch itself? So one request consumes five milliseconds at the database, another request consumes 10 milliseconds at the database. So the relative hardness is one is to two, that's it. And sure, there will be different numbers of them. We will take that into account for the total load. So my total load of the total capacity of my database is 500 units, whatever that unit means. And this request consumes that many units and another request consumes another. So I'm not sure why the independent, independence assumption is there. So they could, in fact, more very frequently they will be targeting the same. And in that model experiment, they were targeting the same CPU, whatever, CPU slash database, right? So here, the capacity estimation problem boils down to somehow estimating the relative hardness of the request. So here there are two types of requests, there could be four types of requests. Between all of them, if you can somehow accurately identify the relative hardness, then you can use this simple power ratio idea to get the capacity of the server. So a simple algorithm to do that will be collect some sample data, give different types of requests, different mix of requests, that is different ratio of the type one, type two, type three requests. So the total load level as well as the relative proportions of the different request types, collect a large number of samples, bombard the server with different load levels and different mix, collect for each epoch, collect the good put response time and offered load, and guess a value of relative hardness. This is one is to two is to three, whatever, pick one value, scale all your samples by that value and see if they fall neatly on a curve. So try and fit a curve and see if they fall neatly on the curve and calculate the regression error. So if you've guessed the wrong relative hardness, your regression error is going to be high, so go back, try again. So this is sort of just searching over the space of all relative hardness vectors. And you can use any, of course, you could do exhaustive search, but that would be infeasible. So you can use some technique like, so my student has used simulated annealing here, any such idea that you can use to search over this space of relative hardness values. And how do you know you've hit the optimal value? You will find your regression error kind of goes down. So we've plotted regression error versus the relative hardness values. And when you are close to the ideal value, that is when your power ratio values will somehow intuitively make sense, because you've waited the different request by the different amounts, right? So the minute you get around that area, you've found relative hardness. And once you've found relative hardness, it's easy to identify the capacity of the server in terms of subunits, right? So does this actually work? So we've tested it with Moodle back end with three or four types of requests, right? And we found that the estimated capacity is within 6% of the actual ground truth value, and the training period, we require about a few minutes of training period. So we found that an epoch size of 10 seconds is needed to get enough data to accurately calculate these average response times and all of that. And we needed a few tens of points to fit the curve reliably. So we required a training period of about a few minutes, and we could get pretty close to the actual capacity of the server. And note that all of this was done without utilizing any server metric. So a lot of the previous literature on web server capacity estimation assumes you know the service time, or you can measure the service time of the server, or you can instrument server code to calculate queue lengths, various things, right? So this is assuming zero support from the server. So this is a black box technique to estimate the capacity of the server, and we still get reasonably close to the server capacity, which is good enough for our purpose of shaping traffic, right? We can be a bit conservative also and take a lower bound on whatever we get, leave a little slack, and this works reasonably well in the context of WebQ, right? So this capacity that is estimated, so there is a small observation period, and after this the capacity that is done is given to token gen, so that it can use this capacity to assign wait times and shape traffic, right? So this training period is still a little bit longer than what we would like it to be. So as part of future work, we plan to remove some of the assumptions that exist around this training period, and we also assume that the server capacity does not change much during this training period itself, because that would completely mess things up. And we also, once the training period is done, you've identified capacity. We can also detect changes in capacity by the same idea, right? So once you've estimated capacity, keep monitoring your response times, good ports, and keep fitting points onto this curve. The minute you find that the points are no longer lying on the curve that you fitted, you know your capacity has changed. If your server, suppose somehow got faster, the back end added another replica or a replica failed, you find that for the same input load, your response times are getting longer, this technique can detect that and recalibrate itself, right? The recalibration takes a few minutes, but it can detect a change in capacity as well, yes? Yes. So after every 10 seconds, you know these are the types of requests, these are the numbers of each type of request. You don't know the relative hardness, but you know, we are assuming you can identify, you can look at a request and identify, oh, this is asking for the schedule versus this is making the booking. Therefore, you can identify the different URLs from the request, but you don't know how much resources they consume at the back end. But you know these many types of requests are there, and these are the counts, the frequencies of the various types of requests in each epoch. Yes, so the way we've tested it is we've generated random load to the server, right? Random input load, random mix, just random number of various types of requests bombard the server, and this is what we got. So they are two separate, so they don't get routed through token gen, the guy who's assigning the wait times, he just assigns the wait times and signs off. He's not involved in any more in the request anymore for concerns of scalability. On the other hand, token check is an inline proxy, it sits in front of the server, and ideally can be part of the server. And you need that to get all these metrics. So how do you do distributed capacity assignment, right? So this is the benchmarking for a single token gen. So we've built token gen on a single machine. So it is, obviously a token gen can handle much more requests than for the, on the same hardware, if you've deployed a server versus token gen, token gen can handle a lot more requests because it's doing much less server, right? The application server is presumably accessing a database, doing some computations, whatever, this guy is just looking up an area and returning a wait time, so it can scale much better, but still, at some point, this is going to hit a limit. So on the x-axis is the number of requests we are sending to token gen, and on the y-axis is its good put, as you can see, after about 8,000 requests, this machine, so this was running on a four core machine. It was running a variant of the Apache server which implements token gen, and it cannot keep up with the load of requests, right? So it, its good put starts to drop, it starts to drop requests. So a single token gen will not work. At some point, it will become the point of failure, even though the website, you're trying to protect the website, but your solution itself gets overloaded. So we do need a distributed design for token gen, right? So you could have, you should have multiple replicas, and the user requests can, you know, through, for example, in DNS, you could return multiple IP addresses or you could randomly return different URLs of different token gens to different users, whatever it is you could decide to split traffic amongst multiple token gen replicas, and these guys together should somehow shape to the capacity of the server. So if your total server capacity is say 15 requests per second, each of them should take some fraction of that pi and shape traffic according to that number so that together, each one is below their capacity, but at the same time together, they are able to handle a huge spike in capacity, and this can be dynamic as well. The more you find the load coming into the server, you could dynamically scale up and down. We want our design to be as easily horizontally scalable as possible, right? So the key question is how do you split capacity between the different token gens? One ways, of course, equals split of capacity, your total capacity C, and token gens give each guy C by N, but clearly that is not desirable because there's nothing guaranteeing an equal split of traffic to all the token gens, right? What if that token gen on the top gets very few small fraction of the traffic and the guy at the bottom gets a larger fraction of traffic, the guy at the bottom will end up assigning longer and longer wait times, right? Which is in some sense not fair. So we would like to have this property, yes geographically distributed in different area, so think of it like something deployed on the cloud, like a CDN, right? So it's hard to have a load balancer in front of it. That is a possibility, you have a load balancer and then you worry about scaling that load balancer, which is a different ball game altogether, but here we are assuming there is nothing in front of it and users are randomly assigned one of the N token gen replicas or according to geography, whatever it is, and we are not assuming an equal split of load. If we assume an equal split of load with a load balancer, the next few slides become kind of pointless, but to make the problem more interesting, we have not assumed an equal split of load, right? So now the goal is equalized wait times. The goal should be such that even if you're distributed no matter which replica you go to, you kind of get the same wait time as you would have gotten in an overall centralized system, right? So why this goal, of course, we could have picked several other goals, but this seemed like a reasonable initial goal. So how do we go about doing that? So all the token gens, they have each one has their own wait time array. They know I've scheduled this many requests into the future. It needn't all be five, it can be whatever each guy has its certain share of capacity and it has a array of wait times and all the token gens talk to each other, exchange this information with each other and each token gen can calculate what is the average wait time that it is assigning and a sum of all the average wait times that the peers are assigning and it can calculate its share of capacity as H over H by P. So if I am assigning much longer wait times on an average compared to my peers, then I'm entitled to a larger share of the capacity pipe, right? So this is the simple algorithm that we've implemented to calculate share of capacity. So this, we still have a little bit more work to do to understand if this is the best possible way to split it and we want to explore the design spaces. That the way to calculate share of capacity is there something better? Is there something that's somehow provably optimal? So all of this is work yet to be done, exploring the design space and also the scalability of this mechanism itself. Here we need every pair of token gens to communicate with each other to exchange this wait time arrays. So in this literature of distributed rate limiting, there is this idea of using a gossip-based communication. You know, X tells Y, Y tells Z instead of everybody talking to each other. So as part of future work, we also plan to explore a more gossip-based mechanism to reduce the communication overhead as well as perform a scalability analysis of this mechanism. So all of this is to be done, but for the present, we have a simple distributed design that empirically works and equalizes wait times and helps token gen scale to a much larger load than what a single token gen would be able to handle. So what I told you about was the share of capacity, right? What each token gen is entitled to. So there is another concept of usable capacity. So note that even if you're entitled to a certain share of capacity, sometimes it may not be possible for a token gen to use all of that share of capacity because it has to respect prior allocations. Note that the other token genes are doing some allocations into the future. So if a new guy comes up, he may not be able to claim his entire share completely. For example, assume at the start of time, there is only one token gen. A lot of traffic is only coming to one of the replicas. And that guy assumes nobody else is seeing any traffic. Great, let me just take the entire capacity and he schedules at that capacity for a long time into the future. He uses up a lot of the future slots. So now in between a new token gen comes up, it starts seeing traffic and it says its share of capacity suddenly increases, right? So for a transient period of time, it will not be able to use its entire share because it has to still respect the allocations that some of the other token genes have made into the future. So we have this notion of usable capacity which is calculated as your share minus whatever other commitments the other people have made into your share. And eventually at some point, the other, the first token gen will realize that there is some other guy that's up here and I'm using up a share and it will cut down a share. But for a transient period of time, you will not use your entire share of capacity, right? So I'll just illustrate it with a simple example. So this is a simple experiment that we, sorry that we did. So the green line here is the server capacity and the blue line there is the load that was being thrown in at the server. So the load that was being thrown in at token gen. So we had two token genes in this case and the blue line was the load and the red line was at what the server actually saw. So the two token genes together smoothed the load to match the capacity of the server. So this is from a actual implementation. Now let's, sorry, let's look at the individual token genes, right? So that is the first guy and this is the second guy. So the first guy started getting load from time p equal to zero. So initially he assumed the other token gen, he saw the other token gen wasn't getting any load. He assumed he had the full server capacity share to himself. He started scheduling requests at the share of the server capacity. But once the second token gen started, that's when that guy realized he needs to reduce his share of capacity and you can see an adjustment period and you can see even here that this guy learns that his share of capacity is a little bit higher, but it takes him a while to actually start sending requests at his share of capacity because he has to respect the earlier allocations that are made. But eventually in the long term these things are stabilized about, both the token genes are seeing roughly similar load so it stabilizes to about equal share of capacity and together they can shape capacity to the server. So if you have different amounts of load, so the total load of 8,000 requests per second is coming in at different values from the two different token genes. If each one, this guy, the y-axis in the two graphs are different. This is around 6,000, that's around 2,000. If they get different amounts of traffic, you can see that their actual share of capacity that they learn is also different and they learn different amounts and adjust dynamically so that the wait time in the end is somehow equalized. So I've told you about the two different parts, the capacity estimation and the distributed design. And now a little bit on the implementation. So we've implemented token gen as a fast CGA extension. It's basically based on the Apache web server. So an earlier implementation was able to handle only a single type of request and it was centralized. Now we have a distributed implementation that can take different types of requests into account. So once your capacity is estimated for different types of requests, token gen takes this. So in calculating wait time, assigning wait times, it takes the relative hardness into account. So the two pieces have been integrated. So when it is marking something in the wait time array, it'll actually mark the number of units of work it has scheduled in a certain slot and not just it has scheduled one request. It'll also mark the hardness of the request implicitly so that even if the mix of requests is changing, you get few harder requests, few easy requests. You take that into account while you're scheduling load to the web server, right? And token gen was implemented as a lite proxy and the web server was a model backend. And it turned out that we built our own custom load generator. So why I'll come to that in a little bit. So our load generator actually required about as much work as the rest of the system given that it had to do a lot of work. It has to actually overload a server, right? So we built an open loop load generator that given a certain number of requests per second, it generated load at that rate and every request was sent to token gen, wait time, the thread responsible for that request waited for that time and then went to the web server, got the response back and it was instrumented to spit out the good code response time so that we could monitor how well the entire system was doing, right? And each request was wrapped in a lightweight Java user level thread called fibers and these fibers are multiplexed on a fixed number of kernel threads. So that the total number of kernel threads that we use is fairly small, right? So this is some of the implementation details. So why did we have to build our own custom load generator? So this graph here, so initially we started out using some of the existing load generators like Apache JMeter. So what do existing load generators do? They are usually built to bombard a server with a large number of requests, but they are not really built to handle this long wait time, right? So when we are testing our system, we take a server with a small capacity and our metric of how well we are doing is actually how much the overload is compared to the capacity of the server. So the minute you have a huge overload, huge peak compared to the base capacity of the server and each request was using up a thread on the operating system, what happened with all existing load generator was pretty much very soon they hit a limit in terms of memory or the number of live threads, right? Because each thread was making a request, it was getting this huge wait time and it was consuming resources on the machine all during this wait time. So which is why when we tried to use, so this is the graph generated by Apache JMeter, we told it to send around 700 requests per second to the server and that red line over there, as you can see it could not sustain a burst of 700 requests per second to the server for a very long time because pretty soon it hit the limit on the number of threads and this was on the most powerful machine we could find in our lab at 24 core server, right? Yes, so the load generation itself is open. So when I say 100 requests per second, I am generating 100 new requests per second. Sure, each request there is a wait time and it sends the next part but I am not waiting to generate the next request after I finish this, right? So this was how one of the best load generators did in our case, it really couldn't sustain and we had lot of issues testing our system at high load simply because our load generator gave up and that is actually, it's not a criticism on existing load generators because this is not the use case they are built to test, right? Typical load generators are built to bombard a server with a large number of requests but no server implementation makes threads wait for typical server response times are small enough that it's only in our use case of wait times that this system fails. So on the other hand, this graph here shows we've asked the same workload, we've asked our load generator to generate 700 requests per second for a lower server capacity and it could sustain 700 requests per second fairly easily and I am not showing there are other results about the memory consumption was much, much lower and so on and so forth, right? Because in the end it is a fixed number of kernel threads and that leads to much lower memory consumption and easier load generation. So now putting all the things together, just a couple more slides, this is the graph on, this is the final graph where we've integrated the capacity estimation, the distributed token gen, everything, there are couple of token gens, load going to both token gens, variable load, variable amount of requests, everything and that is the blue line here is the actual load going to the token gens in terms of subunits of work and this is the server capacity in subunits and that is the final shaping we could achieve. So we are ensuring that the load to the server is below the capacity. And finally, what does all of this gives you? It gives you better response times, right? So the red line here is the response time of the server without web queue and the green line here is the response time with web queue. So obviously once your server is receiving load only at its capacity, the response times are lower, much more deterministic and leads to overall better user experience even after you get into the server, right? So this is fairly obvious from the fact that you're always below server capacity and so your response times are much better. So this is a summary, it's a simple mechanism, does not require modifications to clients or servers, there's a bunch of lightweight proxies and our design itself is scalable and robust to overload. And assuming users are fine with this feedback mechanism of asking to wait instead of hitting random retries, this solution makes fairly good sense. So general feedback, so this work is still work in progress. So an initial version of it that did not handle distributed, not have a distributed architecture does not handle different types of requests was built by a bunch of students. Now my current set of MTech students are still working on this as part of their MTP. So any feedback, any thoughts on the design itself and also general comments on the usefulness of the service itself would be highly appreciated and who could be the potential users, IRCTC or Flipkart when they have their mega, giga, sale, whatever, right? So all of these, any thoughts on this would be great, yes? A cube built up, yes. And that could bring the server down. Yes, yes. So if we, if we are especially on the side of underestimating capacity, yes, you're again getting into the problem of server overload, right? So yes, yes. Yeah, so that is something that we've considered. But the point with that is these proxies need to still have a hook on to the user request or the user needs to come and check in or they need to maintain some communication with the user to inform him or the queue is moving faster slower. So that's, it's always a tradeoff between how much state you are willing to keep that further increases load on the system, right? So it is a conceivable design that you assign shorter than needed wait times. So instead of this total wait time, W just assigned W by to ask him to come back a little bit later. Now that he's waited that much time, maybe you get a better estimate. Yes, that is a good possibility where instead of giving the total wait time at once, you ask the user to come and check back periodically and you will have a better idea of how the queue is moving, yes? Yes, yes, that's a good idea. So not keep any state but ask the users to periodically check back so that there's an improvement, yes. So it depends on the use case, right? It depends on what is the peak load that you want to handle. So at some point, even with 100 replicas at some point, you will hit, there is a physical limit, right? So the way you provision token gen, the number of replicas is based on what is the peak load that you want to handle. Say if your server capacity is 100 requests per second, I want to go up to a million requests per second without wait time compromising, without user experience compromising, beyond that, fine, no guarantees, right? So it depends on how much that peak is relative to your server capacity that you want to provision for. It would not be unavailable but it would be filled up with requests of this particular type for a very long time, yes. That is possible but in the earlier case, you would bring down the server, here at least the server is up and running and hopefully a few other people will get through. You could integrate, you know, intrusion detection system into token gen that weeds out certain requests but yes, that point is well taken and we have no solution for that. It does spread out the load so it spreads out an attack by definition, right? So here the request was view a course webpage or any action that you would want to do on model. So this involves a database access. You go access a backend database, fetch all the course content, display it and you return the HTML response, yes. Yes, yes, yes, they're all real requests on a model. Last question probably we can wrap up and take it offline. And if the user is not waiting, you can deduct that. But this requires you to maintain an HTTP endpoint, right, that we are trying to avoid here, yes. So this is sort of, yes, that is one design option here, our explicit goal was to not maintain an HTTP endpoint for the purpose of scalability. So if that option is there, yes, that is a good way to check if the user is still waiting or not. So that's what I described at the beginning, right? So if you hit refresh before your wait time, so we have some small cryptography, a hashed Mac that won't check out. So there is no incentive for users to jump the queue. So that is one of the, to wait a little bit longer than the first request. No, we are not assuming one request. And I'm not sure, probably we can discuss it offline, but I don't yet fully get the question. So one last question, which is not a technical question, unlike the ones we've seen, right? So this is an example of academic research working on a real problem. Do you want to build this out of the product? I mean, how do you want to publish this? Yeah, so that is something that we've been thinking of as well, and I would solicit feedback on what to do. But yeah, it is possible, you know, you put it out as an open source code, you start up something based on it. If there's enough interest, get people to use it. You can monetize it, serve ads while people are waiting, and just give this free of cost to websites, and you know, just generate revenue based on ads. So there are several different directions. To be honest, we haven't worked all of it out, but we hope to in the future. Yes, sure. Yes. All right, on that note, let's thank you. Thank you all.