 Hi, I am Oleg Matev. And my name is Matev. And today, I'm going to take you to a long, winding, and bumpy road to CronJobs GA. The idea of CronJobs wasn't entirely new in the distributed world. Identical mechanism was already available in Google's internal cluster management system Borg, which divided workloads in two basic categories. Long running services and bad jobs. For the remaining part of this presentation, we'll focus on the latter. It wasn't long after Kubernetes was officially announced by Google in mid-2014 that this functionality will be needed in this project as well. My first exposure to this topic was in the spring of 2015, when a bunch of us here at Red Hat, along with our customers and small community that has gathered around Kubernetes before its 1.0 release later that year. We started tinkering with this problem and writing down our thoughts. First, an OpenShift repository in May-June of that year. Eventually, reaching consensus, I've opened a proposal back in August 2015. Only after we moved our proposal to Kubernetes repository, more people, including the authors of the Borg paper, started looking at it. The proposal was sliced and diced into many directions. Eventually, we've agreed to split the topic into two. First, the original proposal morphed from describing distributed Cron functionality to a primitive which allows running a task to completion. Today, this is simply known as a job resource. This way, our focus slightly shifted and we dived into implementing jobs, which were part of 1.1 release of Kubernetes. It's important to mention the fact that back then, Kubernetes did not have API groups like it has today. So adding a new resource was challenging to say the least. On top of that, jobs and cron jobs were very often trade blazers in the areas of API groups and versions. Finally, after shipping jobs in January of 2016, I was able to finally focus on the initial ask. What I haven't mentioned before is that back then, cron jobs were actually called scheduler jobs. Naming is hard, isn't it? Honestly, I can't remember the reasoning behind the name, but I'm pretty sure it is still there in those proposals and discussions. At that moment in time, we've all had a pretty good understanding of the topic, as well as the primitives and mechanism to build scheduler jobs. So as soon as the proposal was merged, we jumped into the implementation. Even though we initially targeted 1.3 release, we had to delay the functionality until 1.4 due to many challenges I've mentioned before. This way, in the late summer of 2016, we had scheduler jobs in Kubernetes. There is one important caveat which has a crucial impact to the rest of this presentation. As you see, the scheduler job controller was written in 2016. Back then, writing controllers was completely different from what it is today. The controller was periodically pulling all scheduler jobs and triggering the ones that were supposed to run. This way seemed the simplest and most obvious to solve the problem at the time. After that initial sprint, both with jobs and later with scheduler jobs, we slowed down and focused on other features and fixing bugs. The two major developments worth mentioning here were when we decided to rename the cron jobs, which eventually landed in 2015. The biggest challenge was that for a few releases, we actually supported both the old scheduler jobs and the new cron jobs name, which wasn't an easy thing to do. Eventually, in 1.8 we've decided to promote cron jobs to beta without too much changes in the API or the controller. That was back in 2017. Until 2020 hit us from many angles. For us, the biggest impact was the proposal to remove all the so-called PERMA beta APIs from Kubernetes in the near future. The community gathered around special interest group devoted to running application or SIGapps, in short, debated the problem for every API where we've been impacted, since not only cron jobs had a similar issue of PERMA-based state. We've had many lengthy discussions, which you can check in the SIGapps meeting notes and recordings. The outcome of which was that the best way to move forward is to first introduce a completely new controller, gradually switching the controllers through a few releases and only then promoting cron jobs API to general availability. This would ensure that the existing production clusters wouldn't be immediately affected by these changes. The main reason behind all this was coming from how modern controllers are written. The previous controller was periodically polling API server. Modern controllers, on the other hand, are only notified about changes to resources. This seemingly slight change has a significant impact both on the performance, but also on the scheduling algorithm of the controller. But more about this you will hear from Ali, who approached me last year expressing his interest to write that controller. As Maaji mentioned, most modern-day controllers have evolved into a notification-based system to synchronize state. Let's take a brief look at what the architecture of a stock controller in Code Kubernetes looks like. This will help a lot in understanding how the new cron job controller is implemented. As you can see in the picture, there are two key elements to a cube controller that are connected by the cube. On the left, we have the shared informers giving us the ad update delete notifications. On the right, we have the other element, one which is responsible for performing the sync actions. Upon any event from the informer, the resource event handlers allows us to push keys or objects to the cube. The handler on the right then pulls those objects from the cube and performs reconcilation. The cube here makes sure that no two handler workers are reconciling at the same time on the same key. This allows us for safely scaling up the number of worker handler functions. The informer also acts as a cache and allows the controller to list or get objects from it. It is quite clear that if the new cron job controller is implemented on this architecture, we would improve the performance by reducing the number of calls to API server. Also, we would improve on the scalability aspects. Now, let's quickly dive into and look at some examples of the informers and event handlers. So, if you look at the code snippet, this is the cron job controller using the jobs informer to register three event handlers. Typically, a controller works on the set of objects. For example, the cron job controller cares about events from the jobs as well as cron jobs. In order to queue the cron jobs effectively, it uses both the cron jobs informer and the jobs informer. The code snippet shows the add job event handler. Further down the slide, you can see that the add job event handler takes in the object that is coming in via the event, typecasts it into the job object. It then uses the jobs owner reference to find the appropriate cron job for it. It then enqueues the cron job onto the queue. So, to sum this up, resource event handlers are implemented in a way that converts the events from the informers into specific keys that are inserted onto the queue. Once we have the elements on the queue, the wiring of the workers or handlers that perform sync operations is actually quite simple. The worker will then just pull or loop on a function called the process next I work item function. The very first thing in that function is the function is attempting to get the object from the queue. This get call is a blocking call. It will block until there is an element in the queue. Once the worker has the right key to process from the queue, it will just pass that key to the sync function. The sync function is where the business logic of the controller is. Once the sync function is done, this call to process next work item returns and then the loop again blocks on the same call and the queues get function. Looking at the slides more closely, one may wonder what exactly is the queue after variable that is being returned by the sync function. This is actually something unique about the new CrunchUp controller and is a nice segue into looking at the scheduling aspect of this new controller. Along with the updates of the API server, in order to implement this new controller, we also need to handle the scheduling aspect. For example, in this architecture that I just described couples of slides ago, the only way a worker performing the sync operation can be invoked is if there is an element in the queue. What if there is no add, update or delete event around the next scheduled time of a CrunchUp? What will push the CrunchUp element onto the queue around the scheduled time? There is no guarantee in this architecture. So we have to handle this case of scheduling. As you can see in the picture, this is actually handled by the workers themselves in the new controller's implementation. Anytime the worker is done processing a CrunchUp, it will return the requeue after the time interval after which the next schedule for this CrunchUp is supposed to be triggered. It uses something called delaying interface. The queue we use for this new controller implements this delaying interface. This allows us for inserting elements into the queue after specified time interval. So the time interval returned by the sync job, CrunchUp function will then be used to requeue the CrunchUp object appropriately at the right time on the queue. This is how we handle scheduling for the CrunchUp. Now, once the enqueuing of elements onto the queue is taken care of, the next part that comes into this new controller is how it implements the sync function. As I said, this is where the business logic of the controller is. The very first thing it checks is if the CrunchUp is suspended. If it is indeed suspended, this is a no opt for the sync function and it simply returns. If the CrunchUp is not suspended, it then looks at the most recent scheduled time after the last scheduled time. It then says if the current time is already missing the deadline for this most recent scheduled time. If it has already missed the deadline, the controller cannot do anything. It simply calculates the time difference between next scheduled time and the current time and returns this as requeue after so that the CrunchUp can be enqueued properly on the next scheduled time. If it has not missed the deadline, it will then go on to check the policy for this CrunchUp. The current policies can be allowed, replaced or forwarded. Depending on the policy, it will take different actions. These are simple switch case or if else conditions in the controller. You can go check at the sync CrunchUp function to know the exact specifics of how it is implemented. One note is that upon suspension, the sync function does not bother to calculate the time difference and enqueue appropriately based on time. The assumption here is when the CrunchUp is suspended unsuspending it means that we would have to update the CrunchUp and the event will come in via API server through the informers and the resource event handlers. We don't need to handle that case inside of the workers itself. Also note that the sync function that I just described always performs the sync for the schedules that are older than the current time. It always tries to re-queue for the schedules that are newer than the current time. This distinction is important to understand the workflow. Now that the basic workflow of this new controller is established, let's look at some of the corner cases that come in here. One of the challenges is how to handle updates. It's one of the harder things to work through and wrap your brain around first, but once it is done, it is actually quite simple to understand. The interesting part in the updates is what if the schedule is updated. We already have the key for this updated CrunchUp in the queue that is reflecting the older schedule. What happens to that key? These are the kinds of questions that pop up. So anytime there is an update or change in the schedule, we use the update CrunchUp resource event handlers to enqueue for the next schedule time of the newer schedule. This could mean two things. One, that the next schedule time of the new schedule is earlier than the last process schedule time or the other possibility is that the next schedule time is pushed further back and is later than the last process schedule time. These are reflected as two possibilities in the picture, the t1 minus delta t and the t1 plus delta t. Of course t1 being the last process schedule time of the older schedule. So these two possibilities can occur and the second possibility where the new schedule is pushed further back in time is actually easier to handle. The worker will still be fired at t1. It will determine that it has either missed the schedule or already has a job for the last schedule and do a no op. It will then calculate the delta t and push the key back into the schedule accordingly. In this way, we make sure that we have t1 plus delta t at t1 plus delta t, we have a key on the queue to reflect this update. If the next schedule is earlier than what the controller has processed, the update CrunchUp resource event handler will see this change and push the key for an earlier time. This just means that the earlier time key will be fired first and then we will have a worker that is fired at t1 minus delta t. So these are the two ways in which we handle the update of schedule. It was one of the interesting things to implement when reworking on this architecture. Let's dig a little bit at another set of challenges that were introduced due to cash. One of the ways that we are able to increase the performance of the system is to leverage the cash. The classic problem of steady cash can manifest here as well. The CrunchUp Informer and the Job Informer are the two caches that the new CrunchUp controller uses and it can be easy to imagine that if one lags behind the other, it could cause some problems. For example, the CrunchUp controller creates the job for this schedule and puts it in the CrunchUp status in the active list. This creates multiple updates, one update at the Job Informer and one at the CrunchUp Informer. As you can see in the picture, these updates directly go to the API server and then the Informers would be able to see this. Depending on the exact scenario, it could be possible that one Informer sees it before and the other Informer is still a little slow and does not see this update. The specific example illustrated here is that a CrunchUp Informer sees that update first. Its status will reflect an active job, but the Job Informer is still slow and did not see the update yet. When the controller tries to use the list and get the job object from the cash, it will get a 404 not file error. If this case is not handled appropriately, this could mean that the controller would take improper decisions because of the lacking cash problem. For example, it could create a duplicate job for the same schedule. The way we handle this in the current implementation of the new controller is anytime there is a potential problem identified due to lacking cash and it could cause behavioral problems, we actually go to the API server and perform a live get call against the API server to get a fresh copy. In the example earlier, if the job is not found in the Informer, we would make a live call against the API server to see if the job is actually missing or we are hitting the scale cash problem. That way, if it is found in the API server, we would process it accordingly and not cause it to create a duplicate. There are two instances where this could happen and the current controller handles both. The lacking cash is not the only problem that cash introduces. The other problem is actually quite easy to encounter. As I mentioned earlier, the controller uses two Informers which are caching objects. Now, these Informers can be shared across all the controllers in the controller manager. Hence, they are aptly named shared Informers. A caveat to this is the objects pulled from this cash should never be mutated in place. It is actually one of the easiest mistakes to make while writing a controller. This has made it into the list of community guidelines for writing the controller. There is a link to that page in the slides. You can read it in depth. The solution to this problem is actually very easy. Anytime we want to mutate the object from the cash, just create a copy of the object when mutating. That way, other controllers will not see what you have mutated unless you explicitly submit it to the API server. So, what do we get after all of this? One should ask after the entire pre-architect and overcoming all the challenges, what are the advantages we should expect? Theoretically speaking, we should get less calls to the API server. This should make the cron job controller more performant because less time is being spent on waiting for those network calls. We should also put less stress on the API server so the overall system is more performant. It is also really easy to scale up the number of workers. So, assume the case where a user knows that they had a lot of cron jobs in their cluster and the default of one worker is not sufficient for them. The user can safely increase the number of workers they could achieve this scaling up of the controller. So, how do we go about testing this performance improvement? In order to see the improvements, we actually ran some stress tests on it. We used a VM with 128GB of RAM and 64 virtual CPUs, a really beefy machine to create a single node cluster. We wanted to be intelligent about what kinds of cron jobs we create so that we don't overload the entire system. So, 20 cron jobs were created that were to be scheduled every minute. Additionally, to test the controller under duress, we would create another 2,100 cron jobs with the schedule of running at every 20 hours. This would mean that the controller will still have to process all the cron jobs but it does not have to create the jobs or the pods for this. This will limit the number of pods concurrently running in the system and potentially stop overloading other parts of the system like the kubelet or EPSR. We would additionally create batches of 1,000 cron jobs until the total reaches to 5,120 cron jobs. With this sample workload, we would change the controllers with the feature flag and compare both the old controllers' performance and the new ones. The old controller would really start to show some performance problems when adding cron jobs in thousands. For every thousand additional cron jobs added in the old controller, we would see that it requires an additional 2 minutes of delay in scheduling those 20 jobs it's supposed to schedule every minute. That is, for 2,120 cron jobs, instead of creating the 20 jobs every minute, it would take the controller about 3 minutes to create those jobs. This is quite drastic when the workload was increased to 5,120 cron jobs. About 8 schedules were missed before the next 20 jobs are created. So it would take about 9 minutes for the old cron job controller to perform re-sync of the 20 cron jobs. This linear increase in time between processing consecutive job creation when adding more and more cron jobs actually shows the scaling problems of the old controller. Comparing this with the new controller, it did not really see any lag in job creation. We actually went ahead and topped it up with additional 1,000 cron jobs making the total to 6,120 cron jobs and it still did not see any visible delay in scheduling those 20 cron jobs. Now obviously this is not a great real-life model but it actually, the sample workload actually shows how the new architecture is more performant under US and solve some of the scaling problems of the old one. Along with the architectural change, there are few smaller new additions going into the controller. We now have a new histogram matrix that users can go look at. This matrix shows this queue between when the job is supposed to be scheduled and when it is actually created. We have more matrix planned for this controller in the future. There are also minor optimizations. One notable example is the compute next schedule. It was observed that the old controller would use a library function to collect all the scheduled time between the last schedule and the most recent schedule and store it in list. Given that we only need the most recent schedule and the granularity that the cron job supports is one minute, it is actually very easy to replace this for loop and use math to calculate the most recent schedule. This optimization actually helped us in saving some memory requirements. With this new controller, we were able to drive the cron job API to general availability with Kubernetes 1.21. Lastly, I would like to thank the CIG apps community for allowing me the opportunity to work on this controller. It was an amazing ride. Hopefully, this also makes the user experience of cron jobs better. If you guys have any questions, we can take them now. Also feel free to reach out to us offline. Thank you.