 Hey everyone, thank you so much for coming out today and sticking till this moment. I know it has been a lot of technical information that's been down on you today. Well, I'll probably add on to that. So my name is Deepak Pathania. I work as a senior software engineer at IND-VILT. It's a fintech startup based out of GoodGao. Okay, and today I'll be talking about building painless scheduling systems in Node. I am available on Twitter as Deepak Pathania, so if you would like to see me retweet a lot of cute dark pictures and occasionally talk about tech, you can probably follow me there. So yeah, let's get started. Since we're going to talk about building painless scheduling systems, let's just define what a scheduling system is in our context. So it's basically just a system that allows users to execute specified tasks at specified intervals, right? These tasks are basically, you know, recurring tasks that you'd like to automate. You know that they are bound to happen again after a certain interval. And in most cases, there's just a fancy name for a system that runs a bunch of crons with an orchestration layer on top. So yeah, that for us is a scheduling system. The next thing in the title is, you know, building a painless scheduling system. So let's talk about some of the pains that people might face trying to build a system like this. So the first thing is basically retry mechanisms. So when you're trying to build a scheduling systems, a lot of times things would fail, right? So implementing a good retry mechanism wherein you're able to cope up with system failures is very, very important. That is a pain that a lot of people face while building a system like this. The second thing is failure isolation. So basically when you're trying to build a scheduling system, which would probably run tens of thousands of schedules at a time, you would probably want to isolate failures. You do not want 10 failed tasks to affect the rest of the tasks as well, right? So you probably want to implement a good failure isolation strategy as well. The next thing is monitoring. So monitoring Crohn's and the scheduling systems gets incrementally hard as you reach a certain scale. There are a lot of out of the box solutions available in the market for Crohn monitoring. I haven't seen one that, you know, tries to cover all the cases in a very efficient manner and monitoring systems not only just need to cover performance, it also needs to cover the first two points, which is that your systems are implementing the retry mechanisms and failure isolation efficiently or not. So yeah, that's a pain as well. The last thing is testing, which is probably the last thing that comes to people's mind when building a system like this. You probably want to be sure that when your system gets executed at a larger scale and you know, when you're executing schedule tasks that are scheduled in the future, then they'll actually work well in that execution environment as well. So now that we know what's a scheduling system and we know what are the pains that people face while trying to build it, let's look at the problem statement that we're trying to solve. So the way the stock is structured, it's basically like a case study, wherein we talk about a certain problem in the, in the beginning, I'd like to clarify that this is not a one size fits all solution because this is built keeping in mind the liberties we had and also the constraints that we had. So yeah, let's just look at the problem statement. So we basically want to build a system that allows users to set up schedules for different tasks. So user can come in and say, Hey, I want your system to perform this task even for me and I want you to perform it daily for me. And we'll be like, okay, at your command, sir, I'll perform this task even daily for you. Another user can come in and say, Hey, I want to perform this task T2 and I want this to be executed weekly. Right. So the system basically allows users to set up schedules like this and provide tasks a little about the tasks that the user can give. These tasks are very database intensive, meaning that they talk, they interact a lot with our existing database to, you know, actually perform the operation that they're supposed to do. And one very important thing to note here is that eventual consistency is acceptable for us. Now, what does that mean? So you might be building a scheduling system wherein you want something to happen at the very millisecond level of accuracy. You would want to say, Hey, I want this task done at 601 every day or, you know, every hour I want it to happen at 501. We do not have that constraint. In our case, the minimum period that a user can choose to schedule a task is daily. It's basically like daily, bi-weekly, weekly, monthly. Right. So we have the liberty to play around a little wherein if the user came in and said, Hey, I want this task daily, if it happens at 605 instead of 604 the next day, then we'll still be okay because for him, the period was supposed to be daily. Right. So it's an important constant. It's an important liberty that we had while trying to build a system like this. And the third thing is that this is an experimental project. Right. Uh, the requirements are constantly changing and there's only one back in developer, which is me. So we also need to keep this constraint in mind when we are trying to build the system. So let's move on. So the first thing that we did was evaluate the good old build versus by problem that there are, you know, enterprise level solutions like Google cloud, scheduler, easy, and we also have scheduler lambdas nowadays. Right. So we would probably use something that is already built and save us from all the headaches that we're trying to solve by building this, but we realized that the high cost in setup time for some of these systems was way too much of an opportunity cost for us. Since these requirements itself were, you know, constantly changing for the project, we were not sure by the time we'll set up with the requirements would remain the same or not. So we wanted to build a POC first. Sorry. That would allow us to, you know, validate what the problem was with the solution work for us or not. And when we were sure about that, we could probably migrate to a managed solution later, right? But we wanted to build a POC first so that we can iterate quickly and not spend too much time in, you know, dealing with a third party solution. One important thing that I also noted here was that there are managed solutions, which actually allow you to expose certain endpoints and they'll be like, okay, I'll hit your endpoint at this particular time and you can basically send me the data that you want to perform scheduled tasks on. They'll say that, okay, you want this task daily, right? At 5 p.m., I'll hit your endpoint and you give me those tasks and I'll perform, you give me the data and I'll perform a task with that. Solutions like those were way too bandwidth intensive for us because of the constraint we talked about, which is that our tasks were very database intensive. So we could not go that route as well. So once we evaluated this good old build versus buy problem, the next thing was what do you want to build the system in, right? What should be the text tag? What should be the language preference? And we decided to go with node for two reasons, one, the existing product that this was supposed to integrate with that was built on node. So it allowed us to, you know, work seamlessly with that. Second, I was the solo backend developer and I know node. So it was my word against no one else's. So we decided to go with node, yeah, no fancy language comparisons for you there. Sorry. So yeah, we decided to go with node and as any good node just developer would do when he's starting a new project, I did the same thing. I went to NPM and try to look for a project hoping somebody has already done half the work for me, right? So I went there. I typed in all the good old terms, schedulers, Crohn's and all those things. And we found Crohn, right? So Crohn is a NPM module that allows you to set up Crohn jobs via code interface rather than having Crohn tab specifications that run directly on your server. So this fit like a pretty good fit. The code base was good, very well maintained, the documentation was good. So we decided to pursue it. So before we dig deep into Crohn, just a quick refresher with the good old star syntax for some of your folks, right? So the, so the node Crohn syntax is very similar to the Crohn, Crohn tab specification with the exception that node Crohn also has a optional second parameter. So you can optionally schedule stuff to the very second, right? So let's just quickly look at some of the examples to see what these mean, right? So in the first example, I basically see six stars, right? And since I've specified the optional star, that basically means this task needs to be executed every second. In the second one, I've basically specified a range separator as a slash, which basically means that I want to do it every interval of that separated list, which basically means I want to do it every second, right? The difference between the second and third one is I've actually left out the optional second parameter. So now if I specify the same thing, it basically means every 10 minutes because the second parameter is gone and the first parameter is now minutes and the value list separator specifies the interval with which you want to repeat it. And in the last one, we basically replace the star with zero, which basically means that we would like to specify a time here, right? So with the comma list separator, which is a value list separator in the Cron syntax, we specify that we want to execute this task twice, which is at 5 a.m. and 11 a.m. You can read a lot more about the Cron time, Cron scheduler syntax online. I've also mentioned a link here, which will probably help you, but yeah, now that we know what the star syntax looks like, let's move on. So let's look at how what a basic schedule looks like in the Cron context. So basically scheduling a Cron in the Cron context is as easy as instantiating a new object. So what we do is we basically require Cron as a dependency and at the bottom, we have instantiated an instance for this with the new syntax. The parameters that we specified, we'll just go through them one by one. So the first parameter is Cron time, which is the good old star syntax that we just looked at. I hope looking at this, you can probably make out how many, what is the interval with which this will run, right? The second thing is an on tick function, which basically gets triggered at the specified interval. So if you have the same times and tags as every second, this would get triggered every second. The on complete is basically an optional function that you can specify, which would run after your on tick function has executed. If you specify it as null, then nothing would get executed only or on tick function would get executed. Then there's the start Boolean, which basically specifies whether or not you would like to start the Cron job as soon as the instantiation happens. If you pass this as false, you'll have to basically call a job dot start on it to start the Cron job. In my case, I'm just specifying true, which means as soon as I instantiate the Cron job, it'll basically start running. And the very important last parameter here is time zone. So time zone is a gotcha that has got a lot of people. So basically, if you do not specify your own time zone, then the Cron syntax would actually run in a default UTC syntax. So here, specifying Asia slash Kolkata is very important if you'd like to run it in your own region. So as soon as we instantiate, we can see the output at the bottom, which is done processing at and the timestamp, which we have logged in the on tick function, you can see there's a second difference between all three lines, which basically means that this got triggered every second. Okay, so now that we know, okay, what a Cron looks like in the node context, let's look at the flow that we would like to build. Let's try to talk about the architecture. Do we want to instantiate a Cron job for each task that a user creates, let's say three user come in and they say, hey, I want to schedule these three tasks. Do we create a Cron job for all three? Wouldn't that lead to a lot of pressure on our server as number of users increase because we'll have a lot of Cron's running. Also people coming in at slightly different time, this guy comes in at 545, I come in at 546. Both of us want to do something daily. Still we'll be basically triggering two Cron's for us, right? So we decide rather than giving each user its own Cron, we'll have only one system level Cron, right? Only one Cron that is responsible for orchestrating all of the tasks that other users come in and create. So let's look at a simplified flow to understand what that looks like. So this user comes in and he creates a task with a schedule and a job. He says that, hey, I want to perform this job Java with a schedule of daily. Can you do that for me? We're like, okay, we'll create a task for you. So as soon as a task gets created, we also compute certain properties for him. The first thing being a next running time property, which basically specifies when is the next time this particular job has to run. So if I'm coming and I'm saying daily, then the next running time would be tomorrow's date. That okay, the next running time for this person is tomorrow. If I came in and I had said monthly, in the next running time would have been next month's date. And then there's also a status, which is just an enum for specifying whether or not the job has been processed for this cycle or not. Okay, moving on. We specify a single system level Cron, which basically is a middleware for our node app. And that actually triggers a process function after a set interval. So the users that came in, they simply created database records, which were different tasks. And then there is this Cron middleware, which invokes a process function. Now, what does the process function do? Great question. So the process function does a bunch of things. First, it basically filters out the eligible records based on next running time and status. So for example, if you had specified that, hey, the next schedule for this should be daily, then tomorrow I'll check out whether the next, who all are the entries for which the time right now is greater than the next running time for that record, which basically means that they have to be triggered. And also the status represents that they haven't been processed yet, right? So just to reiterate the first thing it does is find filters out eligible records based on the two computed properties that we had computed when a task was created. The next thing is it basically runs the associated job for each of the filter tasks in parallel. We spin off certain number of processes and then we batch them in those parallel rows to process them. The third thing is it basically updates the next running time as soon as it is done processing. So for example, if I processed it today and the schedule was daily, I'll then update the next running time to market as tomorrow so that when process gets triggered again tomorrow, it'll still be filtered out in the first day. Are we clear at this point? Are we following? Can I get a yes if you're following till this point? Yeah, thank you so much. So at this point we've updated the next running time and the next obvious thing is we also update the status of it, right? To market as done for the cycle. Just to reiterate the entire flow, a user came in and created a task with us by specifying a schedule and a job. As soon as their task was created, we computed certain properties like next running time and status for him. There's a system level cron that gets triggered which basically filters out eligible records based on those properties that were configured. It runs the associated jobs for each of those filtered tasks and then updates those properties so that it gets picked up again when process function gets triggered again, right? Cool. So is this it? Probably like this was not very, very complex, right? If we break it down, this was basic scheduling in the general context. This wasn't super duper complex. How did we manage to tackle all of those pains that we talked about earlier? Well, we kind of did. Even though we kept it simple. Let's revisit those pains to see how this approach has probably handled those. So the first pain was, you know, read time mechanisms and failure isolations. So in this particular scenario, what if jobs failed for certain tasks? What would happen? Well, you would embrace failure. We noticed that we update the next running time and status for tasks only once they have actually succeeded in processing. So if you do not update their, if you do not update these two fields, then whenever process function gets triggered again, it will pick them up again in the part in the filtering process, right? So without, you know, doing anything else, if you want, if you just not update these two properties, if you do it auto picked up in the next process cycle and with this, we can go a step further and increase the frequency of process and have a bailout mechanism if there are no filtered candidates. So imagine if I have a bailout for, in the case where there are no filtered candidates and I increase the frequency of process a lot, then I have an idempotent scheduling system which would auto pick, which would have an implicit retry mechanism for all failed cases. Let's look at some code to probably visualize it better, okay? So let's say this is our system level chron, right? Which we had configured. We simply have it set to run up every five minutes and it invokes a process function, right? Let's look at what the process function does. The first thing the process function does is filters out the eligible tasks based on next running time and status, as we talked about. So if the time right now is greater than the next running time of a task and the status shows it hasn't been processed yet, it is an eligible candidate for execution, right? Now, if there are no eligible tasks to process, we simply bail out early, right? We simply bail out of the process and if there are eligible tasks to process, then we run up a separate parallel execution for each of those tasks. And after processing them, we then update the next running time and status. So this allows us to create an implicit retry mechanism by simply not updating the filtering attributes for our cross jobs, which is very simple, but it works very well, right? Second thing is how do you pause or resume some of these tasks, right? Let's say the user came in and said, hey, I had told you to run that task weekly, right? I do not want that to run this week. Like, okay, but I had already batched all of my tasks. I had already created parallel processing rows for them. How can you back out at the last moment and be like, I can, right? What do you do then? Well, you simply toggle filtering attributes. If you simply update the next running time of that particular task to the next cycle, so let's say it was a weekly task, you update that next filtering time to the next week cycle, it would implicitly not be picked in the next process function. So you have an implicit pause for the next cycle, right? So there's a lot of implicit which is being thrown around, but yeah, that's how it's working. You're simply toggling with multiple filtering attributes to get your desired result. And if a guy came in and he said, not just for this cycle, I do not want this task to be picked up ever again, right? Simply update the status attribute so that it does not get picked up again. Status was an enum which you were picking up during status and next running time with the two filtering attributes that you were using, right? So if you toggle the status to a certain enum, then it does not get picked up ever again. Again, very simple, but effective. The next thing was how do you monitor your crown? So as we talked about, not a lot of out of the box solutions actually offer everything, right? So, and obviously you can't also build your scheduling system, then also build a monitoring, your own monitoring system on top with only one backend developer. So we decided to go back to the basics. We decided what is monitoring 101 and monitoring 101 was obviously Slack web books, right? So we decided that what we'll simply do is if the process environment that we were running this in was broad, we'd make an outbound call to a service that sends the messages to Slack. So what will happen is in the error that is being bubbled up from the process function, we did additional attributes like what was the retry, what was the retry mechanism like? Did we do the failure isolation properly or not? What was the time taken for the entire processing to do? What was the batch ID like, right? And then we'll do some fancy Slack message formatting which, for which you have a lot of tools and then simply send that to a Slack channel, right? So this allowed us to have a very basic monitoring tool, not even a tool, it's just an outbound call, but it worked very well for us so that anytime something was, you know, out of the something looked very odd, we could probably go in and check our system. The third thing is how do you test your crons and whether you test them at all or not, right? So the answer to the second question is yes, you obviously test them. And the answer to the first question was, you test the functionality, but you trust the timing. So you do not actually test whether this function would get triggered properly next week or not. You just test whether this function does it work properly when it is invoked or not. So for doing something like that, you go arrest first approach, right? So the way our scheduling system was built, anything that is being, any function that is being picked up by a cron can also be triggered by a rest API, right? So this basically allowed us to say when our monitoring tool told us, hey, for some reason the cron did not work today. I could make a rest call to say that, okay, I'm manually triggering that. This also allowed us to modularize a lot of stuff, right? So we modularized all the schedule tasks in such a manner that they were both API, they were like, they could be triggered by APIs as well and they could be picked up by the scheduling system as well. And a modular approach allowed us to stub a lot of things, right? So stubbing in basically testing context means that you provide a fake function that gets triggered when in the flow a specific function gets triggered, right? So let's say your normal flow was walking and you got a function called hello. So you can provide a stub which says that whenever hello is encountered, call this function instead of the actual hello. So let's look at a quick code sample. There's a lot of code here, so let's try to focus a little. So this test case basically says that invoking the process function should update the next running time for tasks after processing, which is a valid case, right? So what we do is in the first callback, we create a dummy task with a default next running time which basically makes it eligible to be picked up in the filtering. And then we pass on the ID. Then we create a job stub, which basically says that whenever a process job gets triggered, which is the function that allows you to process each of the filter task, then you simply call this fake function and return it, right? And then you actually call the process method of the ground service. So obviously this particular dummy task would get picked up and then it would be processed, but in the processing, the dummy function would be called. And like a good tester, you would assert in the end that when the task was actually finished, the updated next running time is equal to the actually expected next running time. So for example, if the schedule was weekly, you will assert that after the task has been processed, whether the next running time has been updated to the next week state or not, right? So this allowed us to build a system that we were very confident about, that okay, this will go on and no matter the execution environment it gets triggered in, since it is very well tested, we can be confident about its execution.