 a platform engineer at Shopify. And today my talk is about running jobs at scale. Before I begin with this talk, I just wanted to say thanks to all Garuko organizers. Let's give them a quick round of applause. I've heard of Garuko as a really great Ruby conference, which I really wanted to come and speak at since 2015. I applied that here. I applied next year too. In 2017, that didn't work out, but I was so happy to get this email from Joe this year. I also heard this is the last Garuko, so it feels very special to be here. So my talk, background jobs. Many of you here are Rails developers, and you've probably worked with libraries like ActiveJob and Sidekick that let you define units of work that you want to execute in the background. Those are usually units of work that you don't want your users to wait to write on the web requests. Asynchronous things like sending emails and notifications or exporting, importing some longer things that you don't want to be asynchronous. And the definition of these jobs usually looks like this. There is a Ruby class of some name, and there is a perform method, which is the entry point that defines the logic that the job does. Let's jump to a more real example. In this job, we iterate over all products, all records in the database, and call some method on them. In this example, it's sync and refresh. Maybe sync all the products in your database with some other store, reconciliate the data, and refresh the records. Very common pattern from what I've seen in jobs. And this works fairly well, especially when you have just a few records in the database. When you have 100 of records, the job will complete. In a few seconds, you get to thousands of records. It takes minutes. And when you get to millions of records, the jobs are taking days or even weeks. And here we come to a problem of long-running jobs. Let me explain why long-running jobs are sometimes problematic. So when you deploy and hear a vision of code and you want to roll it out, the idea usually is that you shut down workers of the old revision, some processes that are running old revision of your Rails app, and you start same workers, but on the new revision, so that new code gets to production. But think about how you do that if you have a job that has two or three more hours to run. And you have to do something with that. One approach would be to wait for all the workers to complete their jobs. But then the deploy, the rollout, would take days, oh, sorry, hours or even days if you have some really, really long jobs. So the approach that many libraries, like Sidekick, takes is that they abort the job and push it back to the queue so that it will be retried in the future by some other worker on the new revision. But then the existing progress gets lost. And this gets even worse when you have frequent deploys. This simple illustration with the timeline. So you have some job that started running, then the deploy comes, you abort the job, the job starts again after the deploy. It's aborted again. And it's possible that if you have frequent deploys, and Shopify, we have so many developers, we typically deploy every 20 minutes during working hours. So during those working hours, no job that takes longer than 20 minutes would be able to succeed. Maybe in the night or in the weekend, there would be a window of no deploys. And the job may eventually complete. But the experience there wasn't that nice. The next problem is the capacity and worker starvation, because when you get with more long-running jobs, there is a higher probability that too many workers will be busy with those long-running jobs. And if you have some higher priority job to process, something like a job that processes a payment that is important to execute earlier rather than later, all workers would be busy with long-running jobs. And it becomes a problem because your customer will have to wait for that payment processing or a checkout. Long-running jobs are also trickier in cloud environments, because in those environments, hardware is less predictable. And Google or AWS may give you a notice that this instance will be shut down in a few minutes, because it's not too healthy. And your application code, your logic, must be ready to handle those interruptions that come from cloud environments. And for us at Shopify, this started to become a very pressuring problem, because we were getting too many long-running jobs. We could also find workers that were running jobs that are taking weeks. At the same time, we were moving to cloud, and we had to do something about this. And we started researching why we got so many jobs that are taking long. And what we found is that it mostly happens because those jobs iterate over a long collection. For instance, Shopify is a commerce platform, so we have merchants on our platform. We had jobs that iterate over all products of every merchant, for each merchant, and for a smaller merchant that has less products, the job would compete faster, and for an enterprise merchant with millions of products, the job would take forever. So we started thinking, what if jobs were interruptible and resumable? What if we could abort them on deploys, but somehow save the progress and then start the job later, but from exactly the same point where it was stopped? We came to this idea of splitting the job definition into two parts. One, collection to process, which can be a smaller collection or a longer collection can be a million of records in the database and work to be done on each record. So in our previous example, collection to process would be product.all, all records in the database. And the work to be done would be a method call on each product object. This is how it started to look like. We would include some module that gives this iteration feature. This is the very simplified version, but we would have instead of having one perform method that does things, we would have a method that defines a collection and then a method that is called on every record in that collection. By giving this a bit more structure, we unlocked the interruption and resumability. If we started thinking about relation or actual relation or some collection of objects as a collection, then we could have a cursor and then iterate over it and persist the cursor between job interruptions and then eventually it would get to the end of collection and the job would finish. This was not just for actual record relations. In fact, we could build any enumerators and even some custom ones or a CSV file could also be a enumerator. When we started introducing this, we never realized what kind of possibilities this brings. For instance, we could do progress checking for free. We could also paralyze computations because those units of work were smaller and really well described. We could also start throttling those jobs and iterations automatically based on load on the database. And for people who are responsible for the uptime of the infrastructure, like my team, this allowed us to make scale invisible for developers even if the collection that they want to iterate on has millions of records. It gave, it unlocked success for the cloud runtime and even got us to opportunity to save money with short-lived instances in cloud, which are cheaper but they can disappear at any point because now all our units of work were interruptible and the progress was saved. We're going to open source this very soon and I'm also looking forward to chat with any of you who have been solving problems related to background jobs. That's something that my team works on. And thank you all very much.