 Hi, my name is Kevin and welcome to my keep on talk Today we'll be talking about Kubernetes cron jobs. Does anyone actually use this thing in production? Well, the short answer is that yes people do at lift we run some 500 plus cron jobs in our Kubernetes clusters currently and In this talk, we'll talk a little bit about what it's been like But before we get too far into that, let me start by introducing myself My name is Kevin Yang and I currently work on the compute platform team at lift where we work on Building out our Kubernetes Kubernetes environment that powers all of this services and compute needs This has been a multi-year journey and There's been a lot of things that we've learned along the way and that we're excited to share with the community If you're for interested in getting in contact with me my Twitter and email links are posted in the slides All right, let's get into it So what is this talk? In this talk, I'll tell you some stories about cron job failures that we've experienced operating a cron platform using kubernetes at lift Along the way, I'll try to surface and poke holes at some of the flaws of cron jobs that we've seen From both the technical perspective, but more importantly the user experience We'll discuss kind of how we moved is how we smoothed out some of these rough edges to deliver a better experience to our engineers That use our platform and finally we'll discuss some of the broader lessons that we learned Operating such a platform using Kubernetes and what this might all mean for you So why should you care well, maybe you're like us and you run a Kubernetes platform at your company and you might have Dozens or hundreds or even thousands of developers that Run compete workloads on your Kubernetes clusters. Maybe you use cron jobs directly. Maybe you run, you know, lots of cron jobs Just like we do Or maybe you're just looking for some info on distributed cron scheduling on repeated schedule task platforms and you want to see what Kubernetes has to offer Well regardless of all these things one of the main things that You should care about is what the user experience looks like for people who use Kubernetes cron jobs in particular if we found that lift are you know one particular point in one particular Workload type in Kubernetes that hasn't gotten as much love as some of the others and I think there's a lot that we can learn from kind of dissecting the issues and And figuring out well, how could we can make it better? But first, let's start with some story time It's generally good practice to have Observability for your cron tasks to make sure that you're alerted when they fail At lift developers do so most commonly by emitting metrics in their application code and use our alerting system to page the on-call when those issues happen So let's imagine you're a developer at lift and you're just paged for a cron job failure And it is your job to investigate what is going on so you know a little thing or two about cron jobs and Kubernetes and You use some tools to start inspecting what the status of all these different objects are So first you might do a keep color get cron job Maybe you look at its events as and then from there you can see that the cron job has spun up a job object and From the job object you can see that spun up a pod object to actually run your application code Well, the problem is you know, you're probably at the point now where you're looking at these Kubernetes objects and you just got paged for failure and by the time if you actually look at it The pause already gone Those of you who have some experience working with Kubernetes know that this is far too common And a pretty big annoyance when trying to debug issues with applications on Kubernetes Because pods are fumerable oftentimes It's hard to look at them because they've already been terminated So you can't look at any pod events. You don't know what the status of your containers are So what can you do? Well, I guess you could look at your application logs if they're you know logs somewhere out to a log pipeline But more often than not you're kind of stuck and you have to kind of wait for the next time the cron runs to catch in the act Meanwhile, you might have asked your ops team and the platform engineers are probably scratching their head as well You know, there's nothing that they can do either And so you really have to wait until next time the cron runs to catch in the act Or worse the cron job failed too many times in a row Maybe this happened overnight and you just got paid for the morning and now we'll start it all and you have to go into Kubernetes and explicitly tell the job to run. This is a pretty Pretty annoying thing to have to do And it's definitely something that we've run into lift several times Now here's a different scenario that happened once We had an incident that took down our cron environment for several days and took a lot of time in engineering Effort to debug and figure out But there's a lot of interesting things going on here that helped us learn more about what's going on with cron depth under the hood and will help show case some of the drawbacks and Technical issues that impact cron job performance So this started out this incident started out with a lot of user reports started came coming in We saw a lot of users that were starting to report issues with their crons failing to run intermittently and Now anyone who has worked at a company large enough to handle the support requests knows that There's varying levels of detail that users provide But we had some we had some users who actually put in a lot of effort into some investigation and that kind of helped us eventually cause the issue So when a suit user actually filled out the ticket form and attached some of the investigative work that they've done specifically they showed us this chart which Is a chart of log volume and every time their cron runs it logs some messages and For the times when it didn't run there's obviously no logs emitted And so we saw a chart that looks something like this where you know There are several iterations of their cron job that emitted their logs and show that their application code was executing But every so often there'd be a hole or a gap where there are no Logs emitted which indicated that their application code didn't run Now as a platform engineer, we looked at that and you know, we thought is this something that happens on all crons? Or just a few crons and sure enough most of the crons that ran on our platform Were fine. They did not have any such issues But there were some crons where we saw the same kind of log Indicator that showed that application code was not running So what gives at this point we started looking into What is actually happening with the cron job controller under the hood and one of the things that we did was Looked at the log level of the cube cube controller manager, which runs the cron job controller Some background on the cron job controller The way that it works is that it runs a sync the world operation every 30 seconds And so what this does is essentially there's one big loop in the code that lists all cron jobs every 30 seconds and iterates them iterates through them one by one Doing whatever reconciliation Whether that is invoking the cron job doing some sort of bookkeeping or whatnot So when we were looking at this loop We started noticing something interesting by printing out the logs of the project controller, which was that In our current environment, we had about you know, 200 to 300 cron jobs at the time for the first 20 or so I Derations of the loop that is the first 20 cron jobs that are processed We saw that they were all processed pretty quickly the time to process them was really low and You know the contra one sure was behaving as expected where we started seeing some issues was after the 20th iteration We saw that the times are nearly 10 times slower to process crons for the remaining, you know, 200 plus crons that were in the list So what might explain this well, it was rate limiting What the cron job controller does in sync operation is it actually, you know send some API requests to the API server using the Qt client that the Qt controller manager uses and What we'd noticed was that There's actually rate limiting that was bogging down the cron job controller Sure enough the defaults for the For the Qt client rate limit lined up with the loop iteration times we were seeing So it turns out when you run, you know, 200 or more cron jobs, you know those times that up and you start seeing loops That take longer than 30 seconds and hence you start seeing cron jobs get backed up as the cron job control is bogged down by rate limiting So what does this mean for our applications our cron jobs? well, a lot of our cron jobs had this field called starting deadline seconds set and What this field does is essentially tells the cron job controller to stop trying to run it if it has been delayed by you know the starting deadline seconds duration and So what we would see in an application is that right before the cron was supposed to run say t1 That's the last time that the cron job processed That the cron job controller process that cron job Now the cron job controller loop took You know say longer than 30 seconds and so by then the starting deadline seconds already expired And so the next time the cron job controller, you know, it inspects that cron job and determines if it has to run Well, it sees that the starting deadline seconds has already expired. And so therefore it doesn't invoke it And so this is why our applications our cron jobs sometimes saw that they had missed their schedule Because deep down the cron job controller loop had taken too long to process and had missed a particular scheduled time What did these experiences tell us? Well, we learned that cron jobs can fail in many surprising ways That we did not know about before the first of which is the too many mistarts condition, which is where if a cron job has failed too many times in a row then the Kubernetes machinery ends up giving up entirely on trying to run them and requires human intervention to start the cron job again next we saw that API client rate limiting can be a factor when the cron job controller does the sync the world operation it requires a lot of client calls to the API server and these can get read limited and impact the speed at which the cron job controller can process and sync cron jobs This combined with starting deadline seconds can lead to some disastrous effects where cron jobs get missed entirely So aside from the technical issues, we also saw that cron jobs in general are quite difficult to monitor understand and debug there's not a lot of observability that ships with crown jobs by default and You have to know a lot about the system and the underlying behaviors of Kubernetes in order to debug these understand what's going on as we saw it through our instance it took a lot of effort and learning from the Kubernetes team to be able to understand what's going on here and be able to root cause the instance and Further cron job does leak a lot of abstractions from pods and jobs There are a lot of knobs to configure especially regarding retries and concurrency behavior and it's a little bit balance a strike In order to not surprise your developers and your users We also saw that Fillers are quite difficult to recover from in the example of the on call that got baged Well, it would be nice if they could just you know rerun the cron job if it failed You know, maybe there's a service dependent on in their application From a platform engineering perspective We really don't have a great way of understanding the performance of the platform you know, we got a lot of user reports of issues, but it'd be nice if we could be alert around these and Not have to rely on users to tell us that something is broken. So what did we do about this? One of the things that we did was enhance the observability of cron jobs on our Kubernetes platform By introducing trace points at various stages in the crons lifecycle So this can be things like, you know, when the crunch benchholder decides to invoke the cron when the application code actually starts when the application code finishes whether it was retried or You know exit the set successfully and At each of these points we emit metrics so that we can know Some details of the performance like the start delay How long did the cron job take to actually start running application code from when it was expected to run? Or the runtime of the application container itself and the exit codes These are all things that our developers Want to know and want to be alerted on but didn't want to have to write and maintain themselves And so as a platform we were able to have all these metrics built right in by default and Have alerts created for any cron job that was on board onto our platform and this allowed us to get rid of a lot of bespoke alarming and Metrics code from our applications and our services Next was disaster recovery So we saw earlier that when a cron job failed oftentimes a non-call didn't have really anything they could do about it So one thing we created was a run cron button, which essentially allows you to run a cron ad hoc So this is useful for recovering from failures as well as testing out new cron jobs and debugging existing ones instead of having to wait around for the cron to run you can just hit the button and Immediately see the effects of your cron running and your application code doing its work Finally we fixed the longstanding too many miss starts problem in our Kubernetes fork So they never had to deal with stock crons again Once again, this was a problem that we had seen a lot of times that lived Where after a cron has failed a hundred times in a row then Kubernetes stops trying to run it entirely and So by fixing this in our fork We never had to do all that situation again, where we had to manually re-invote crons to get them started again So now you're probably thinking to yourself that seems like a lot of work to get Kubernetes cron jobs to a usable state What does this all mean for me? Should I still try to use them? Is this the only option? There's a lot of nuance to picking technologies like a distributed schedule task runner Here are some things to consider before making the same decisions that we did at lift First ask yourself. What do your users really need? The primary feature you should have figured out for engineers using your platform is observability Based on our experience talking to devs at lift devs mostly want to know did it run did it run successfully and How long did it take? At lift we built these metrics and alerts into our cron platform So all cron jobs get them by default these monitoring tools are essential to allow in engineers to self-service and be able to Operate their workloads no matter the platform They'll allow people to have high confidence in the reliability of their cron without having to know Kubernetes as intimately as platform engineers do and really help scale the operation side of a large engineering org and making it easier for people to debug issues on their own Next you need to be able to run a cron ad hoc to debug a recovery from instance No one wants to sit around waiting to catch a cron in the act and especially in today's microservices world Failures can happen and we must be able to fail gracefully and make easy to recover from failure Because of this it is essential that your cron platform has an ad hoc invocation tool That way say a downstream service is down and causes your cron job to fail At least the on-call is alerted using the alerts that we just mentioned and can rerun the job without having to wait for hours Or even days for the machinery to invoke the cron again This is absolutely essential for say once a week cron jobs that do things like generate generate reports for your company An added benefit of a tool like this is that it allows engineers to develop new cron jobs and observe them running Live in your staging and prod environments as much as we try to make configuration Simple inevitably there will be some trial and error involved when deploying new code for the first time There's nothing worse than having to wait to see your new code come into effect So having a tool be able to trigger a new code and be able to watch it You know work for the first time is really helpful helpful for developing debugging new crons So those are the user-facing features that we focus on delivering with our cron platform using Kubernetes cron jobs at lift Kubernetes cron job may not be the right solution for you depending on what your environment is like If you're starting from scratch, how should you approach evaluating cron solutions for your company? When evaluating cron platforms, the first thing you should think about is the user experience Devs really just want to write and ship code and be confident that if there's an issue they'll be notified and When the notified they can easily figure out what went wrong and recover So the very first thing they should do when picking cron platform is talk to your users Figure out what the workflows are like and what they need out of a cron environment because what? Users might want might differ a lot from Might differ a lot from what you might want as a platform engineer But as a platform engineer, what should that experience be like? So the major thing that platform engineers are concerned about is the performance of the platform How easy is to monitor the reliability of jobs on your cron solution? You want to look at things like failure rate time. It takes to start running application code How well it scales as more and more cron jobs are added to your environment So these are all things that you want to think about You know, are they built into your cron platform or do you have to add some stuff in order to have those metrics and be alerted on them? Features aside, there are many hidden costs you might not think about when using a new tool like Kubernetes cron jobs like How much do you expect your devs to know about communities and how much effort is it to train them? How much are you willing to invest in smoothing out some of the rough edges of Kubernetes? For example, a lift we already had a lot of infrastructure and tooling in place for things like observability and reliability So we only needed to plug in some modifications Into them in order to read the benefits And lastly along the way, there will definitely be a lot of incidents How good is your learning culture and how resilient is the business to incidents that will inevitably happen? Hopefully the picture is starting to get a bit more clear as engineers We can often get pulled into deep pulled in deep into the nitty gritty of technical complexity Evaluating tools and technologies from a technical perspective is certainly valuable But more often than not, the challenging part is bridging the gap between human and technology At the end of the day, there will be other human beings using the tools you build and offer So think a lot about how humans interact with your systems So to conclude, not all hope is lost Recently a KEP was merged for graduating cron jobs to GA And it's really exciting to see some progress being made in this area And I especially like how this time there's a lot more concern around observability and performance scaling of cron jobs So kudos to the Kubernetes team and I'm definitely looking forward to Playing around with the new cron job API when it comes out But all things Kubernetes aside Talk to your users have real conversations with engineers at your company and ask them what they really need from your infrastructure And lastly, Kubernetes is no silver bullet But don't be afraid to get your hands dirty anyways and try to bring it closer to something that's usable in the real world And that's it. Thanks for taking the time to listen to my talk on Kubernetes cron jobs at lift If you're interested in working on challenging problems related to running kubernetes at scale then Feel free to send me an email and meanwhile check out our careers page at lift.com slash careers for current openings. Thanks