 Let us call Sandeep for the next topic, Shaq. Sandeep is the primary author of Shaq and product engineer at Pellivo. So please welcome Sandeep on the stage. I am Sandeep. I work as a product engineer at Pellivo. So the product engineering team at Pellivo is basically concerned with, you know, building services which talk to the core telephony component. So that's what we do. And I'm also the primary author of Shaq. So that's what we'll be talking today, the rate-limiting queuing system which we built and which we are open sourcing today. Yeah. Yeah, moving on. So the problems, right. Any web services which is deployed on the cloud has these two main problems. The first one is unpredictable traffic spikes on the cloud infrastructure. So generally what we do is we kind of plan for capacity, right? I mean, when you have a cloud infrastructure running, you plan like 1.5x or 2x, generally, of your peak traffic and you say, okay, this is what my infrastructure is going to be and you work out of that. But what of, consider a scenario where, let's assume you suddenly get a very high spike, like, you know, maybe your average request is 10 per second and then suddenly, like, you know, within the next five seconds, it increases to like 1,000 per second. How do you handle that? So that is one problem. And there are a couple of ways to quickly solve this, right? The first one would be the first thing which people would think is, like, you know, over provision, like, you know, have a lot of service already in production and then, you know, you say that, okay, my infrastructure would be able to handle a lot of traffic already. So I need not worry. But that kind of does not make any business sense, right? I mean, why would you have a very huge cluster just for a minute or two minute spike or not even minute or two? It kind of is like an overkill because you would be spending a lot of money on the infrastructure and it kind of, you know, you would lose a lot of money if you do that. The second thing would be to, if you are on some infrastructure provider like AWS, you would maybe have some auto scaling, something like a machine goes down, it comes up or when the load increases, it comes up. But again, there is a problem where, you know, by the time the machine comes up, the spike might have already ruined your entire system. Like, you know, it would have had a very bad impact on your system. So those are the two immediate solutions, but they do not solve the problem. To give you an example where this might happen, let's consider, for example, a data analytics company or something. So what it might do is it might have a set of claims who might be giving out data inputs, say, for example, like Google Analytics or something which would send out page hits to your server. And suddenly, let's assume one of your claims goes on the front page of Hacker News or TechCrunch. It gets featured in TechCrunch and there's like a sudden spike in traffic and there's a lot of traffic coming to your infrastructure and you just need to handle it. So that is one particular use case which you can talk about. The second one is gracefully handling third-party services. So here, what happens is, I'm sure in most cloud services, we would be using some third-party services. Like, for example, you would be using MailGun or SendGrid to send out emails. So what would happen in such a scenario is that the third-party services would actually have a cap on your usage. What they say is, okay, you can't send more than 300 emails per minute or per hour or whatever. They have those caps. So that is another problem which we face generally. So looking at the first problem, if you take a look at a simple architecture which would represent this, so it basically would have a HTTP server that and whenever the request comes to the HTTP server, it would forward the request directly to its respective cluster. Like, let's assume there is a different service cluster number one, service cluster number two, service cluster number three. So what would happen in this scenario? Like, what happens, what would happen in when a spike occurs is represented in the graph. So you can see that in the graph to the left, right? So there is a sudden huge spike on the HTTP server. So because of this, because of this spike, you can see that the entire service, because a specific service cluster is being impacted very highly, which is the graph to your right. So you can see a clear correlation, right? I mean, as soon as the server which is fronting earlier infrastructure gets a high load, then it just forwards all the requests to a particular cluster and the cluster just, you know, is suddenly under high load. And what that does is because of this spike, it affects all the users using that particular cluster. So this is kind of a not so efficient architecture. So that is the problem number one. And to illustrate the problem number two, it would be something like this, where there's a third party service and you're trying to make a request and it's again the same problem which I explained, they would have caps and you would not be able to, I mean, you might not be able to adhere to their rate limits. So to think about it, the most simple solution for this would be something like this, like, introduce a queue, right? I mean, this would solve the basic problem where all the requests are directly sent to the next level of clusters. So what this would do is this queue would basically buffer all the requests and then the service cluster number, service cluster two, can pull out these jobs depending on their capacity. Like, you know, if they're ready to accept jobs, it keeps upsetting. Otherwise, it will be in the queue and it can accept at whatever rate it wants. So this is a very simple system, like this can be very easily implemented by some most popular queues. Like, for example, salary, right? You just define a task and you just say, okay, I just need this worker to be running in this cluster and the other worker to be running in a different cluster and then it just pulls the jobs. But there is a primary problem with this. The problem is that there is a single queue. So what would happen is, let's assume there is a spike and the spike is related to a single cluster. Let's assume it is service cluster number one. What that would do is that would fill up the queue and that would slow down the request by people who are trying to make a request to the cluster number two, right? I mean, there is a single cluster and all the jobs are in that and it is basically causing problems where spike in one cluster is also affecting the service performance of the other cluster, right? So to think about it, the easiest solution would be to put two queues, right? So service queue one and service queue two. So this would solve that problem. What this would do is it has two queues, which means it is, so the HTTP server when it gets the traffic, it knows that it has to route the request for service number cluster one and puts it in service one queue and if it's a request to the service, if it's a request to cluster two, it puts it in the service queue two queue. So this again seems to solve the problem. It seems to be like a good solution, but it is not complete. So the problem here is that even though we separated the dependency between two services, now we have a problem where there is a content sheet between two users using the same service. Let's assume there is a user who has made like 10,000 requests to service cluster one and you have 10,000 jobs in the queue. What if there is another user who is making a single request to the same service? Because there is single queue, you would put him at the back of the queue, which means that he has to wait until all the jobs are completed. So that is one basic problem with this. So even this kind of an architecture can be implemented very easily with something like salary. So what you do is you just define two queues, two tasks and then what you do is basically have a worker running in service cluster one and service cluster two connecting to that queue and just pulling out jobs. So if you think about it, this still has a problem. It's not even like an ideal solution. It doesn't solve the problem completely. So what you would ideally do is an architecture like this. So this is where the shark, the queue which we built comes into picture. So what this does is it gives you the ability to create queues dynamically. So what you can do is have a per user queue or something and then set a rate limit on that user queue. So what that would enable you is that you can rate limit on a more granular level, which means that let's assume a user who's generating a lot of traffic, who's generating a spike will be put in his own queue like his specific queue with a specific rate limit. And that won't affect all the other customers or all the other users who are using the same service because each of the customer or each of the user has his own queue. So this is what shark exactly does. Basically, it creates queues dynamically to ensure fair queuing so that a spike from one user does not affect the spike from other user. And then is the ability to change the rate limits on the fly. And then it gives you a constant flow. It just does not give you jobs and burst. It just gives you a constant flow. So to give you a glimpse of what shark can do, let's dive into a quick hands on demo. So I'll just run through the demo what I'll be doing in the demo. So I'll show you a code snippet basically like the code which I'll be running. So basically, we'll have a worker listening on a particular listening on a queue of a particular service type. And then I'll emulate the situation where there is a customer who is generating a spike and then show you how the queue is able to rate limit and give out jobs at the specified rate limit. And then what we would do is we will emulate the situation where even though there are a lot of queues in that for that particular service type, I will make a request by another user and show that this user request does not get affected because there are already jobs in the queue. So I will show that. And at the end, I'll just show you how easy it is to change the rate limit on the fly for the queue. So let's dive into the demo. So I just need something to hold the mic or something. I'll be showing a code demo. So I just need someone to hold the mic so that I can. Thanks. Hey, once again, I just need to determine it. Should I move it or something? You can try. Other side. Yes, I just want to put this visibly. So Sharks server has this command with the configuration file. So it just takes a configuration file. And once you hit it, it starts running. And then, so this is a simple worker zip. What it's trying to do is basically running in a while loop so that it's dequeuing jobs from the queue at every instant. And then I'm using the request library to make HTTP requests. This queue exposes HTTP APIs for putting the job into the queue and pulling out the job. So there is a HTTP API where you can also basically in the, in the, what I'm doing is basically doing a dequeue operation there, which is slash dequeue slash sms. sms here is the queue type, which I, the service type, right? There are, there were two services service to cluster number one service cluster number two, and there were like a lot of queues, right? So this is sms is the one less assume one of the services. And what it does is it returns, if it the dequeue is success, it returns 200. And if it's the dequeue is dequeue phase, it returns a 404. So I'm just handling both the cases. As it's not like non-blocking, you just need to keep on pulling the server for jobs. So once I get the job, I just, you know, process the job right now. I'm just printing the job and then making a finished request. So what finished request means is in the shark workflow, how it works is whenever you enqueue a job, it goes into the, it goes into this queue, and then when you do a dequeue, it gives you the job, and then it waits for a particular interval for the confirmation from the worker. So the worker, once the job is dequeued, the worker needs to confirm saying that the task or the job which it just dequeued is successful. So that is the job of the finished request. So finished request is again the same. It just has slash finish slash the type of the service, which is SMS, and then QID and JobID. So QID and JobID are basically the unique IDs which define each customer, or in this case, customer or user, and then JobID is what it defines uniquely each job which you are trying to process. So let's look at a simple script which also enqueues jobs. So what I'll do is I'll just run this and as the shark server is running, it'll just keep on pulling sharks ever and waiting for jobs. So let's do this, and then we'll come back to another script which is basically doing an enqueue. Yeah, so you can see that it's pulling and it's waiting for jobs again and again, and then we'll go to the other side and let's see the enqueue script. Yeah, so what I'm doing is here, I'm trying to emulate the scenario where a single user is trying to make a lot of requests. So this is a particular user with ID number one, and he's making a request to the service type SMS, and I'm just inserting 10,000 jobs into the queue on a loop. And you can see in the line number 12, there is a parameter called interval. So what interval does is interval specifies the rate limit of the queue. So what it tells you is that shark server should wait for one second before dequeuing two jobs. So it is basically the inverse of rate limit. So to give you an example, if you want to rate limit this queue, to say that shark should only give out a job of this type with one job per second or something like that, then you would put this as 1000, which means 1000 milliseconds as one second. So this is the inverse of the rate. So you specify that and you specify the payload, which is the actual message which has to go into the queue, and then you just make the HTTP request. So let us actually try to make the HTTP request and see how it's being rate limited. Yeah, right? So at the right, you can see that there is a lot of jobs being enqueued, but you can see that it is getting dequeued at a control rate. It is getting dequeued at one job per second rate. So it's just waiting and it's, it just waits until the time, for example, I mean the shark server does not give out jobs until it is ready to be dequeued, depending on the rate limit. And then once it is, once it is ready, it just dequeues and it is coming out on a control rate. Okay, so let's now emulate the situation where there are like a lot of jobs already in this queue. Let's emulate the situation where now one more user is trying to make a request and how that, how the user is not affected by this. Right? This is, this is user number two. Who is trying to make a same request? Same type of request for the same service type. So if you, if you, if you just run this, you can see that it's, that is too right. I mean, you can see the message that it's saying from, from two. Right? So, so even, even though there is a lot of traffic, because of user one, it is not affecting the traffic of the user two, because, because each, each of them has their own queue, which is created dynamically. And the rate limit is different for both of them. So this is, this is what Shaq does. So to show you, to show you one more thing, the ability of Shaq to change the rate limit. So what, what, what shall we do is this right now is dequeuing at a rate of one job per second. Right? So let's, let's increase that to, I mean, let's decrease the rate and say it should only dequeue one job in five seconds. So what I would do is I have, we have an, it Shaq has an internal API using which you can do that. So I'll just show you a quick example. Yeah, right? So what this is doing is, it is of the queue of the service type SMS. And for the user one, it is trying to set the interval of 5000, which means that one job in five seconds. So that's what it's trying to do. So as soon as I do this, you can see that it starts slowing down. So just now it was dequeuing at a rate of one per second. Now it's dequeuing at a rate of one per five seconds. So this, all this, all this can be done without even any configuration changes. You can just do it on the flight dynamically in real time. So that is where Shaq comes into picture and that is how it differentiates itself from all the other existing queues. Yeah, so let's get back to the slides now. So, yeah, so, so you might be guessing already that, already that it might be based on this algorithm. So for those of you who don't know, there is an algorithm called leaky bucket algorithm. So what the algorithm tells you is that, it is like a bucket which is leaking water. So what it tells you is that regardless of how sporadic or how erratic your input to your queues, it should be dequeued at a specified average rate-limited manner. So you can see that there is a lot of, you know, a lot of drops going into that bucket and there is like a, you know, very consistent flow of traffic out of that. So this is, Shaq is based on this concept. To dig more into the internals, so Shaq is built on, Shaq has two components, the Shaq server and the Shaq core. So Shaq core implements the basic logic of rate-limiting and Shaq server implements the, implements a HTTP API over the Shaq core. So once I can just remove this. Yeah, so the Shaq core is basically the layer of the queue which talks with the redis. So we use redis for storing all the jobs and then the Shaq server uses Python Flask and G event. So Flask is used to expose an HTTP API and G event makes it async. And message pack is a serialization format we use to make the payload more compact when we store it. So once we say redis, it's like a, I mean, you might be thinking how we might achieve high availability, right? It's kind of in memory and it's, it is not basically persistent, right? So this is what we do in production. We have a Shaq master and a Shaq slave, which is basically like a redis master and a redis slave. The slave at any point of time will be having a streaming replica of the master and there is a project called sentinel, redis sentinel, which takes care of automatic failover. So as soon as the master goes down, it automatically promotes the slave so that it is the new master so that this is how we achieve high availability. So once we have achieved, I mean, Shaq is, when we're building Shaq, there are two aims. One is achieving high availability, not, not because it, I mean, not, not by building, I mean, Shaq doesn't support high availability out of the box, but it is designed in an architectural way where it can be deployed to get high availability and scalability. So this is how we achieve high availability. For scalability, we, what we do is, we call that single master slave pair as a shard and what we do is we have multiple such shards behind the load balancer. So if you can, if you can see that the producer is enqueuing a job to the load balancer directly without going to the machines and then there are, there is a consumer or worker trying to dequeue the jobs. So there are two caveats here, when you deploy Shaq and this architecture. The first one is that because the producer does not know how many shards are there and each shard is independent, whenever a job is enqueued it can go to any machine, which means that if the, if you send 10 jobs there can, 5 can be in one machine and 5 can be in another machine, which means that when it is getting dequeued the rate can get doubled. So if you say that I want 20 per second, if you use two shards it effectively comes out at 40 per second. So at any point of time the producer should know how many shards are behind the ELB. So that is one caveat. The second one is that the consumer right, the consumer should know where to send the finished request, the confirmation request. So the dequeue happens at the load balancer, I mean the dequeue goes to the load balancer because of which it doesn't know where the job initially came from. It can come from mission number one or it can come from mission number two and the finished request has to be sent to the same mission, right? If it goes to the different mission it won't have the job there. So to do that what dequeue does is dequeue along with the job returns a IP address of the mission from which it was dequeued from so that the worker can make a request to the mission directly. So you can see the green line there which is, which shows that instead of going to the load balancer the finished request directly goes to the master of the particular shard. So that's how we achieve high availability and scalability. So as shard is open source and it's available so we have a roadmap here for shard so right now what's available and what's not available. It's a new system which we have built and it is still in initial stages but we have been using it in production for a while now. So we have the enqueue API which is basically to enqueue a job. We have a dequeue API to get the job out of shard and then the finished API to mark the job as success or fail success. And the interval API and matrix API are other APIs where interval API is used to change the rate limit of any queue in real time and the matrix API gets you a very brief matrix. So what we have in pipeline is to build some feature where you can check for job status of each job and then you can have a feature of max retrace or even extending matrix API to give richer analytics. So one thing which is already also missing is the shard client library. If you see any queuing system the client library does not have to do much right. I mean just has to connect and dequeue a job but right now in us you just have to have a loop and which does all those dequeues. So we already use a shard client library in production right now but it's not yet mature to be open sourced. So it's in the pipeline which we will be open sourcing it very soon. And of course as it's an open source project any of your suggestions any of your improvements which you guys want or any of the things which you think is a very good thing to go into shard as a feature would be implemented. So shard is available at shark.io you can check out the documentation. You can check out the source code there are GitHub links, documentation FAQs. So you can I think you probably can check out shark.io. Yeah that's it guys. We can have a brief QA session. If any of you have any questions. I just want to know like how is this shark server different from this rabid MQ server like how is it what are the advantages of using sharks instead of rabid MQ. So the main brand of shark is that they are very good to be able to track things when you need it. So that is one way to see it. So some of you can use some of the things that you want to be able to track things when you need it. So the basic thing is that you want to be able to track things when you want to take specific tasks in these kind of situations, these type of things. And then you might have any function or something like that. But which kind of thing will create as many things as you want on the plate when you need it with everything you want to be able to do. So that is one good point and the second one is you can see very much one thing. So what can be considered as the biggest thing that you will have to be able to track things when you need it. And then you will be able to see the things that you need to do. So that is the basic thing. So probably after one more question. So you have talked about the output please. What is the output please. What is the output please. What is the output please. What is the input please. Okay right, right now the way it's architected, it's architected in a way where the basic assumption is that the broker is able to take as much as it can. So that is the basic assumption. So also, also it kind of does not make sense. I mean you want a queue to prevent such a scenario. Say for example you are using this for an API or something. The worst thing would be the worst thing for the customer would be have exceeded your rate limit. So instead of that, you can just accept and push it into the queue. So that is the basic problem which we are trying to solve. So if you try to control the rate limit while you are enqueuing, it means that you have to throw an error back to the user. Exactly. I get your point. But that is what it is trying to solve is what I am trying to say. So shock is trying to solve that problem. Shock is trying to say that your resource is not getting overlanded. I will just accept it and push it into the queue and it will get dequeued at a later point of time. So why was the design decision made to start this project from scratch instead of just patching it onto salary which is more widely used? That is a pretty good question. So when we evaluated this right, I mean when we evaluated salary to use this, so when we looked at the design of salary and the way it uses Redis data structures, the inherent design was not rate limiting for salary. So there was a lot of work involved which involved a lot of changes to the internal salary code which was kind of risky, right? I mean you can do that, you can always do that but it is always best to get away with it as long as you are, so it is like going to the danger zone. You need to maintain it on your own and then you need to make sure that it is bug free and it is not causing recursions on any other part. So that was one thing and moreover I think it was the basic underlying factor was that it did not, the salary was not architected in a way to support queues dynamically, to support rate limiting queues dynamically. So that was one reason. Do you intend to support Rebit MQ and priorities? To intend to support Rebit MQ as a broker and priorities? Yeah, that is actually probably a good idea but I will tell you why it was not possible. So you can use Rebit MQs with priority queues but the number of queues you define is constant, right? I mean, you are saying, can I plug in Rebit MQ? No, right now, no. So right now, shark is heavily dependent on the data structures, underlying data structures for Redis. So right now there is no support for any other broker except Redis. So you said that you used Lua in your project. So could you just throw some light on what you used Lua for? Okay, that is actually a good question. It is my bad. I missed the point to say why I am using Lua. So basically what Redis does is it explores a scripting language called Lua. So if you see the project, I mean, if you check out the source code, you will be able to see that NQ operation involves a lot of separate Redis operations. I mean, I update a lot of data structures in Redis in a single NQ, which means that I can't be having it, I should be having it in a single transaction. I can't like, you know, have separate Redis commands running over and then what if that process gets interleaved and there is some other process trying to access Redis the same data structures that would bring an inconsistency, right? So to make the operations atomic, say, for example, the NQ operation should be atomic saying if there is one NQ going on, there shouldn't be another process which is trying to NQ to Redis. So to achieve such sort of atomicity, we use Lua. You have dynamic Qs. Yeah, so you're aware that there are like multiple Qs and each Q has a rate limit, right? So what we do is basically, I mean, I'm not sure how many of you understand Redis data structures, but there is a data structure called sorted set. So what it does is the sorted set takes a value and a score. So what it tells you is that this set can have a value and at any point of time, it will be accompanied by a score and the score, the values will be at all points of time sorted with respect to score, right? So at any point of time, if you pull the first object, it ensures that it is the lowest score in the entire queue. So we use that feature. So what we do is we basically put the Q ID into the as the value and the score as the time at which it should be DQed. So what the worker does is worker keeps on checking that Q saying that the is the first job ready is the first job ready, checking with the current time. And then as soon as it is ready, it just pulls out the first job and gives it out. You said you have master and slave and use Sentinel to really decide election. How does the producer or consumer know which is the master and which is slave? Or does it matter? We send the request. Yeah, so this is actually a good question. So if we go back to this, I'm not sure if we have time. So, guys, okay, good. So, right, if we go to this architecture, basically what would happen if the master goes down, right? At this point of time, the Sentinel would kick in and what it would do is it would find out which slave, if there are multiple, which slave to promote and then it promotes. So there are some things which happen at this point of time that there might be some jobs which are de-cured and which are still pending to send the finished request. So the finished request would have the IP address of the old machine, right? It would not have the IP address. I mean, the machine would just bend down. It would have the IP address of that. So what it would probably do is the claims can be the Sentinel, at the end of the Sentinel, at the end of the failure process, there is a script which can be run to reconfigure all those things so that the client can be notified saying that, okay, this is the new slave or this is the new server to which you have to make a request or possibly what can happen is that that is, I think that is one good approach to do that, do the failure process. Yeah, so if you understand how Redis Sentinel works, right? So Redis Sentinel has like so many Sentinels installed. So the ideal way would be to run Sentinels on each of the claims. So as soon as it knows that the master is down, it starts a failure process, promotes a slave to master and then at the end of the Sentinel, there is a script which can reconfigure its own claims. So it will anyway be there in all the claims so it will reconfigure its claims. So I like the idea of making claims. So in the burst kind of example, right? It's not just that there are too many requests but that also sometimes some kinds of bursts are too many clients. So if you go to front page of Hacker News, you have everybody trying to load your homepage, for example, right? So you have thousands of clients each trying to make one request. So here you will create thousands of queries, which doesn't really solve the problem. So we have a lot of things to do and it has a pattern available. So when we load the site, we have the same pattern here, between the same pattern here and in the middle, we have the same pattern here and in the center, we have the same system. So we can create an instance where we can create news, rates, we can create news from that particular website. Which means that in that particular, the password from the portal to that particular website is not affecting the current process. Okay, so the queue destroy process is automatic. So if there is no job left, it just destroys the creation, right? It depends on hello, you able to hear me? Yeah. So the queue creation process, right? It depends on your business logic. So if you want the queues, you can basically say, for example, you have, you're making an API request and you have a service which users used to make API request. So what you can do is probably create new queues for each user. So you can have a user ID as the queue name, so that each user will have his own queue. Did I answer the question? Yeah, so basically on the business logic. So this is one example which I'm giving, but depending on the business logic which you have, you can create a number of queues. Yeah. So in RabbitMQ, we have the priorities that has been set for the queues. Likewise, even do we have that here? Priorities that have been set low, medium and high, which is being, which is the default here? So there is nothing called as a priority here. So priorities manifested as rate limit. So if you want this particular queue to have, I mean this particular queue to dequeue at a higher rate, you can set higher rate limit so that it processes jobs at high intervals. But two queues who have the same interval, there's no priority. I mean there is no factor which determines this queue is more important than this queue. The only factor which determines which queue to pick the job from is based on the rate limit. So in that scenario, instead of making user level queues, you can create queues based on the priority. So you can say this queue will have 30 as the rate limit and 30 jobs per second. Or this queue will have 40 jobs per second depending on the job or the message type. So you can probably use that. Or what you can do is you can, if you want to have both, like if you want to even have this and have that, then you can change the queues. So change the queues in the sense, it's a way by which you can have. So it basically is rate limited on a particular way where it is dequeued at this queue based on the user. And then probably you can put it in another queue if it is going out to a third party service provider based on that particular rate limit. So you can just re-enqueue it back with different QID, which is the... Yeah, probably, yes. We have questions, we can take one more. Okay, so I'll answer the second one first. So about the IP address, right? And it depends on how you deploy it. So if all of them are in the same... Let's say you're deploying out on AWS and all of them are in the same zone or all of them are in the same region, then you can probably be done with private IP addresses which AWS gives, the FQDNs which they give so that it can route via the private network itself. And the question about what happens if there is no finish request, right? So SHARC can be configured in a way where you can set the final saying, if there is no finish request coming from the worker before this time, then re-queue the job back. So what SHARC does is wait for that time and sees if the finish request comes. If it doesn't, pulls out the job from... I mean, it's internal data system and puts it back into that queue so that it can be de-queued again by any other worker. Yeah, okay guys, thank you. Maybe we can catch up post lunch, we can discuss more. Thanks. Thank you, Sandeep. We have lunch available.