 The next speaker is going to be Vishrut and Vishrut is going to talk about how to build a production ready distributed task queue management system with Celery. But you know like a little bit to say about Vishrut, he makes, he's going to make a lot of bad jokes apparently, but when he's not making that, he's a software engineer at Grovers and he's working as a data science lead at a VC funded edutic startup called Leverage EDU. He embarked on his Python journey from the start of his college, but he's always learning a lot more. He was also a MATI Japan internship scholar and he's won a lot of hackathons and stuff. He's highly enthusiastic at teaching and we're going to see a little bit of that now. So Vishrut, I'm going to welcome you to the stage. Please follow us. So hi everyone, I'm Vishrut Kohli. First of all, I would like to thank all of you for attending this talk and I'm really excited to be here. So I started working with Python or should I say fell in love with Python in my early days of college and now I am using it as my primary tech stack at work where we use it at tremendous scale. I am working as a software engineer at Grovers. If you do not know about Grovers, it is one of the most popular grocery e-commerce in India. If you want to chat about anything else or about this talk, you can reach me from my LinkedIn or through my personal portfolio. So without any further ado, let's dive into the good stuff. So today we are going to see how to build a production ready distributed task you management system with salary. So when I say production ready, I mean which is highly efficient, which is scalable, which is transparent and which is resilient. So in this talk, we are going to cover what our task use and why we need them, what is and why salary, building a distributed task youing system, tuning it to get maximum performance, adding resiliency or self-healing capabilities to the system, what to do in times of SOS or emergencies, monitoring the system we build and most importantly, bad jokes. So I've tried to make this talk as descriptive as possible, but still there are some basic prerequisites like some basic knowledge of Python, some basic knowledge of web development, worked or even heard about salary before. And most important one is a sense of humor and love for GIFs because there are a lot of chips in this. So let's start by task use. So let's assume I own a mall and I want to keep track of how many people are entering in my mall. So I installed a small IoT sensor at my entrance. And whenever someone enters my mall, it shoots an API request to my web server. Then the request goes to the database and increments a counter. And at the end of the day, I can just check my database and see the count. This system was working pretty well for me. So one day I thought I'll stream a football match in my mall. So a lot of people came to the mall. And I was really excited to see the numbers in my database. But when I checked my database, the numbers I observed were relatively low. And I knew something was not right. So I investigated and figured out when a lot of people entered my mall for each person, an API request was raised to my web server. And there were a lot of concurrent requests trying to talk to the database. And due to the atomicity and locking at my database, many requests were timed out. And that's why the low count in the DB. So there's got to be a better way. And there is task use come to rescue. So let's see what our task use. So if someone asked me this question, when I was giving my university examinations, I would have answered task queue is a queue of tasks. And that is exactly what it is. I don't know why teachers don't like those answers. But yeah, it fits perfectly here. So now in the new architecture, whenever we get request from a web server, instead of going and trying to increment the counter in the database, it puts it into the task queue and returns a 200 response. And now the database can consume the request from the task queue at its own pace. So now we moved from a more real time approach to a more eventually consistent type of approach. And that is okay for us because I only needed to see the count at the end of the day. Okay, so what is and why celery. So you must have heard about task use, there are a bunch of them available, like Amazon, Amazon, Amazon sqs, Amazon mq, Redis, Rabbit mq, but building and building a consumption and publishing mechanism for those task use is not that straightforward. To help us with that, Celery gives us a plug and play task queue management framework, which with which we can manage our tasks are distributed task use with ease. So in this talk, we are going to use some keywords. So let's just iterate over them once. We already know what task use are from our previous example. But we'll just say it again. Task queue is a queue of tasks. Then there is task. A task is the basic unit of work of a task queue. And a task queue can contain a number of tasks. Then here comes the worker. Worker is the basic unit of computation, which lies outside your application and where a task is processed. Then in line there is broker. Broker in layman language helps us with picking an offloaded task, putting it into a task queue and delivering the task to the worker from the task queue whenever the worker wants to process it. And the last one is result backend. It is a highly available database which is used by Celery to keep track of all the tasks and their results, along with storing all kinds of metadata for Celery. Some examples for result backend can be Redis, MemeCache, et cetera. So before we start building the system, one question arises, which broker to choose? There are a bunch of brokers available like RabbitMQ, Redis, et cetera. They all are great pieces of software, but works best for their own specific use cases. So for example, I'll cover the most common ones, RabbitMQ and Redis. So if you are looking for a highly efficient broker, which supports several workers consuming from different queues and also offers some kind of persistence of tasks when it shuts down, then no doubt RabbitMQ is the way to go. But RabbitMQ is a little more time consuming to set up and maintain. On the other hand, if you just want to use your broker as a quick messaging system, Redis is the way to go as it works really well for quick tasks and is very easy to set up too. Okay, so let's start building the system. So let's think of an e-commerce warehouse to build and there are going to be mainly three things which happen there, picking of the products, packing of the products, and delivery of the order. So the most basic kind of architecture for my warehouse would be something like this. I have one boy who picks up the products, packs the products and delivers them. And this worked for me for some time. But now the orders are increasing. And I want to scale my setup. So I employed another girl in the warehouse. So now they both are parallely picking, packing and delivering the products. And this is fine. This is fine as when more orders will start coming in, I'll just add more people in the warehouse. But I think I can improve it a bit further, because as I know that these two people are really good at picking, but they are lousy at packing, and they don't even have delivery bikes to deliver. So what if we break this work into smaller fragments and get specialized people to do what they do best? So let's see this. So now those two people are just doing the picking because they were good at it. I added an experienced packer who has its own packaging station and everything. And I added people with delivery bikes to deliver more efficiently. So now this way, we had one big task, we broke it into smaller tasks and executed them in order. And now in further slides, we will call this our pipeline. So why should we even use pipelines? So there are a bunch of advantages we get by using pipelines. So let's go by them one by one. First, it gives us the ability to see bottlenecks and scale smaller components of the system instead of the whole system. So for example, so now if I see that there are a lot of orders pending to be packed, so I can just add more people in the packing worker and scale the packing operations instead of scaling the whole pipeline like we did earlier with the girl. So second, this will give give the ability to give different kind of machines to different tasks. So as per our example, we can see that the packing worker needs a packaging station, but a delivery worker needs a delivery bike. The same thing happens in our tech system, different tasks need different kinds of infrastructure. Some might need more CPU, others might need more memory. Third, it helps us keep track of status of tasks and will add some kind of resiliency to the system by enabling retries at every step. So now if a task fails, it will not retry from the beginning, but we'll get retried from the last checkpoint or the last succeeded task in the pipeline. Okay, so now let's assume we have a sale going on and we have we have a lot of orders pouring in and our warehouse is already full and we can't even add more people to the warehouse. So we got two ways. First thing we can do is buy a bigger warehouse, move all the operations from the smaller warehouse to the bigger warehouse and add more people in it. In tech terms, we call it vertical scaling. On the other hand, we can purchase another makeshift warehouse of the same size, add more people there and run these warehouse run these two warehouses in parallel, whilst the the operations inside them are concurrent. So in tech terms, we call it horizontal scaling. In my case, horizontal makes scaling makes more sense as the number of orders are variable. And after the sale ends, one warehouse would be able to get all the orders alone. And then I can just shut down the my new makeshift warehouse. So the code for our application would look something like this. We have an order receiver API, which receives an order, offloads it to the picking worker, which is the entry point in our pipeline. And the code in our pipeline is something like this. It starts with the picking worker, which picks up stuff from the aisle and passes it to the packing worker. The packing worker packs the stuff and passes it to the delivery worker. And in the last, the delivery worker delivers the stuff in time and makes the customer happy. Okay, so now we have built our system, but we don't know how well it performs. So first things first, it is always better to benchmark before moving to any further optimization. Because in my experience, I have seen if we go by intuition, either we end up over optimizing the system or optimizing wrong parts of the architecture. So for example, in our pipeline, when I ran a low test, I saw the number of tasks queued at the picking worker were much higher than any other worker. So I knew from where I have to start optimizing. So let's ask this question to ourselves, can we use batching? So let's assume what happens in the picking task. Here in the picking task, a person is assigned an order, it goes to the aisle, picks up that order and passes it to the packing worker. Now assume you have a lot of orders coming in, it's a sale. And to cater them, you added a lot of people in the picking worker, and everyone is trying to get something from the aisle. As lots of people will be crowding the aisle, there will be some kind of wait time for every picker to pick their order. The exact same things happens in our concurrent systems. The aisle acts as our database, and the people acts as our concurrent threads. So to solve this problem, we can introduce batching. So instead of one person picking up one order, we can make one person pick up 10 orders. This way, we are decreasing our trips to the aisle, and our database by 10 times. But as we know, every good thing also comes with a trade off. So now your retries and failures also happen at batch level. So if the ninth order failed for some reason in a batch of 10, still the whole batch of 10 will be retried. So if you are okay with this trade off, this can definitely decrease the load at your database and increase our performance. So there is not much change in the code for our application. But instead of offloading it to the picking worker like before, we will now offload it to the order aggregator worker. And the code is also pretty much the same. Just one more task named order aggregator is added, which contains the order chunking logic. And instead of part, instead of passing just one order to the picking worker, it passes a chunk of orders to the picking worker. So next optimization would be always split tasks into IO bound and CPU bound tasks. So IO bound tasks are tasks in which thread blocks the CPU and weights until an input or output is received. This makes the CPU unusable for the time it's just waiting. These kinds of tasks can be optimized with the help of gvent or eventlet pool, which helps us enable a non blocking IO approach in which the thread goes to the CPU registers its request does not blocks the CPU. And whenever its input and output is ready, the CPU raises a callback and the thread goes and collects it. This way, our CPU is never blocked by concurrent IO processes. On the other hand, a CPU bound task is a task which uses the CPU for crunching numbers or doing CPU intensive tasks. For these kinds of tasks, we should use a pre for pool as it is based on Python's multiprocessing module and helps running parallel processes on multiple cores. And all this is very easy to set up to you just need to pass the pool name and the desired concurrency needed in the following command. And you will spin up a new worker with the provided configuration. Okay, use warfare optimization when possible. So this is quite interesting. The default approach in salary uses a round robin approach to distribute tasks among distributed systems. So if you have a set of tasks that take varying amount of time to complete, either deliberately or due to unpredictable network conditions, this will cause an unexpected delays in total execution time for tasks in the queue. So you might end up having some tasks queued at some workers, whilst some workers are idle. To solve this problem, you can use warfare optimization, which distributes the task according to the availability of the workers instead of the workers available. This option comes with a coordination cost penalty, but results in a much more predictable behavior. If your task is having varying execution times, as most iobound tasks will. So keeping track of results only if you need them. So as I told you about result backend in the beginning, which stores all the metadata statuses and results of salary. If you know you're not going to use them anywhere in your application, you can decrease the amount of network calls to your highly available database, and it can give you some amount of optimization. Okay. So now we, so now we'll see how to add some kind of resiliency or self feeling capabilities to the system. So I think we all agree what sentry.ios tagline is software errors are inevitable, but chaos is not. And that is so true. So the most basic version of resiliency is to enable auto retries in times of failures. So and you can also add a circuit circuit breaking element. For example, I've added five as a max number of retries. And if a task is retried five times, and still failed, it will be ignored so that we don't fall into an infinite loop. To make it more resilient, we can add exponential backup. So for example, your task is dependent on another service. And that service is down. And let's assume the time between consecutive retries is 10 seconds. So if the service is down, my first retry will happen at 10 seconds, the second one at 20 seconds, the third one at 30 seconds, the fourth one at 40 seconds, and the last one at 50 seconds. So in this case, I gave 50 seconds breathing time to the other service to come back up so that I don't lose my task to increase that amount of breathing room we have, we can use exponential backup, which means the first retry will happen at 10 seconds, the second one at 20 seconds, same as before, but the third one at 40 seconds, the fourth one at 180 seconds, and the last one at 160 seconds. So now the breathing time is increased from 50 seconds to 50 seconds to 160 seconds. And if you want more breathing room, you can just change the exponential backup. So next up is ax late is equal to true. This means as per the name suggests late acknowledgement. So by default, a broker marks a task as acknowledged when it is delivered to the worker. But if a worker goes down and restarts, we lose that task. So to make your system resilient towards worker failures or infrastructure failures, we can use ax late is equal to true, which means until and unless the task is processed by a worker, it will not be marked acknowledged. So even if the worker goes down, the broker delivers the same task to it as it was still stored in the worker and was marked an acknowledge. Okay, and the last argument is retry jitter is equal to true. This parameter is used to add some kind of randomness to the system. So let's assume we have a concurrent system. And there are chances that two tasks are trying to access the same database resource. And when they execute, they will form a deadlock and fail because they are trying to access the same database database resource. So we have our automatic retries enabled from before. So they'll get retried again. But at the same time, they will form a deadlock and fail again. And this will be repeat this will be repeated till the circuit breaks. So in situations like these, we would want some kind of randomness to the retries so that they do not get retried again and again at the same time. And that is why retry jitter is helpful. And if you want to keep track of your circuit break failures, you can use a DLQ or a dead litter queue to store your failed tasks. So when your system is down, and the first thing you should do is check your CPU and memory utilization. If your CPU utilization is high, then maybe horizontally and vertically scaling according to our infrastructure can help. But if your memory utilization is high, and you know for a fact that your code is not using that kind of memory, there are chances there is some memory leak in your code. So I know what you're wondering, you might be wondering memory leak in Python that is impossible. And I am with you if you are working with core Python that is impossible. But when many of the libraries you are using are built using C Python, or even the Python interpreter you are using has some kind of memory leak. Then there are chances there will be some kind of memory leak happening under the hood, which is not in your hands. So to solve that problem, salary provides two thresholds, max memory per child, and max tasks per child. So with the help of these commands, you can set a threshold either on number of tasks executed by the process, or the amount of memory being used by the process. And when any of the threshold is reached, it rotates the process and clears out the stagnant memory so that we do not get OOM killed errors or out of memory errors. So when we are running something in production, we should have the capability to keep an eye on it. And flower works really well. With just running one command, you can set up a full fledged monitoring tool for your salary setup. So it gives you the capabilities like purging queues, view acknowledgement rates, view and modify queue worker instances, view scheduled tasks. It has an HTTP API for almost all the data data points available so that you can integrate it in your own monitoring dashboards as well. And also using those end points to configure alerts so that you know if your system is going down beforehand. If you're using Rabbit MQ as your broker, and you are more comfortable with the Rabbit MQ instead of flower, you can use the Rabbit MQ admin panel to to monitor your system at the broker level itself. So it also gives features such as purging, deleting, monitoring queues, etc. Just like flower before. So to conclude, in this talk, we understood why pipelines are better. How to tune salary configuration to get maximum performance. How to make your salary setup resilient or self healing. What to do when unknown things are hogging up on memory resources, and how to keep an eye on our system. So if we follow all these steps while building our system, we will be facing a lot less issues and our system will be production ready. And we will sleep soundly. So that is it from me. So if you have any questions or feedback, as it was my first talk, I'd be happy to work on it. Also, if you didn't like the presentation, I'm also open to take virtual tomatoes. So yeah, that is it for me. And over to you, Avinan. Sure. First of all, let me begin by saying that you've cured this back pretty perfectly in terms of time. So thank you for that. And, you know, you have you're in a very safe spot with respect to tomatoes, you won't be receiving them, I suppose, because it's virtual. But I don't think there will be the there are we have three more minutes, three, four more minutes for our question answers, because the next thing is a break anyway. So and there are a bunch of questions, actually. Let's get that show, like, you know, there are too many questions to be answered. So we'll take maybe the last few. But the rest of them, we will make it available to you so that you can answer them on the zealot chat. And you know, like, we also if you can provide your contact information, we'll get in touch with you as an email ID or, you know, Twitter handle, similar to how we did it with the channel, right? So I think we'll take a couple of questions now. So I think one of the questions is, can, how can I put a delay in a queue so that every task inside a queue will be picked by worker with a predefined delay? So yeah, let's put that up on a scale. Yeah, so basically, if you are using celery, celery provides a feature such as countdown. So countdown means that if you want to have some kind of delay before your task is picked, you can use that feature. And if I put a countdown of 20 seconds, so it will pick up the broker will deliver the task to the worker, it will put that in its memory. And whenever the countdown is finished, it will get executed. So it has one drawback that if you have a lot of things, countdowning, so it will it might have some kind of memory constraints on your worker. But yeah, you can use countdown if you're if the tasks are not very much. Okay. I think we have time for two more questions again. What kind of data can we send to celery? I think that's a generic question. And I'm sure everyone any kind of data which can be serialized in JSON or XML or even pickle, you can send it. So if your data is serializable in any of these formats, you can send it over celery. So one more question was, is it possible to schedule tasks using celery? And there were a lot of answers for that, like mentioning celery, beet and crown and airflow. But if you have any more suggestions in your arsenal, you can mention them. Yeah, I think people have answered already in the chat. But yeah, if you want to use celery only, there is celery beat, which is a very neat piece of software. So you can use it to schedule tasks. It just takes one, it just takes one instance and your whole distributed task system can be it can can use it to schedule tasks. So yeah. Um, okay. So one car says, but countdown means it'll be executed late but not picked up late, right? Yeah, it will be executed late, but it will be delivered to the worker. But if you want it to be executed after some kind of delay in terms of time, then yeah, it, it will be executed after that time, but it will be picked up from the broker and delivered to the worker. But it will be unacknowledged until it gets processed. Um, I hope you've got your answer. And yeah, I think a lot more comments, but then can you provide some contact information? Wait, I'll just share that screen again, so that where I had all my contact information, I should have added it in the end, but you can contact me through LinkedIn. Actually, I just watched social dilemma. So I'm having my aversion towards social media. So but yeah, you can contact me through LinkedIn. I'm pretty much available there. Or you can email me my email ID. I'll put it in the stage chat also. But my email ID is Kohli Bishrut at the rate gmail.com. Great. I think there's a mild delay. I can take a screenshot of the contact information to be provided later. Okay, if after they talk, I can just go to the stationing and I'll add the information there also. Great. Okay, so again, folks, thank you so much for being such a participative audience. Thank you to Bishrut for an awesome talk. It was very clear. And you know, participation is I mean, even if it is online, great indicator of how immersive the talk can have been. And I think it has been great. And I know it's a tough act to follow Wesley, but you've done an awesome job. Thank you. Thank you so much.