 Alrighty, so I'm Brennan Hayward from Catalyst IT. This is going to be a high level talk about managing Cronwell and managing Cron particularly at scale. I've been working with Central Queensland University in Rockhampton for a number of years and I'm going to take three examples from their higher education model. I'm choosing these three examples because they've got different load characteristics. So the first one is the link crawling robot. It's like Google, goes and scrapes your page, looks for broken, big, slow links, reports on it. Second one is an assignment extension tool where a student can say that they're sick, they want an extension, they upload a PDF, there's a whole workflow process where it might get escalated up to a unit coordinator and then a couple days later if it hasn't been actioned, it goes up to a faculty admin. And then there's the core forums sending emails. But first things first, is your Moodle actually running well? Is it running at all? Moodle, Cron, especially if you've got lots of third party plugins, can be broken in a whole bunch of ways. You really should be monitoring it. So unfortunately there's no core tools to monitor the status of health. So we wrote one. We wrote the heartbeat plugin. We wrote this a long time ago. It's actually got deep compatibility way before 2.7. And it sort of surfaces a whole bunch of problems which you can then go and diagnose a bit more properly. I don't actually want this in core. I'd like to see a new API in core where each plugin can declare its own health checks. So for instance, see all that plugin could say can I bind or whatever. And then Moodle aggregates it all into an overall status of health. Alrighty. Now that we know that things might be broken, you might have gotten alert or something. We need to know what's actually going on to the hood. The first question you want to ask is, is it actually broken and it's not finishing or is it actually just taking forever? You can have some cront tasks which just absolutely take forever and they just grind. And the traditional way you do this, you might go and find out what's locked and then you might find a process and then figure out if it's actually still running. You might trawl through the cron logs. The cron logs can be a little bit misleading sometimes, especially for aggregating cron logs across lots of different processes. They all get mashed together and they're not like an Apache error log which is structured so they're a little bit hard to pass. So we wanted to make this really easy. So we made another tool, the cron logs statistics tool. And it looks like this. It shows you what's running right now. So we've got a task that's been running for seven minutes. We've just passed the minutes. You can see it's kicked off a bunch of other stuff. So there's five things running in parallel. And you don't want to record stats on everything. You set a threshold and it just sort of compresses statistics for everything but the more interesting long running stuff it keeps a bit of a detail on. And I'll just point out the forum at the bottom here. It took 15 minutes. And that's a bit slow, but in the grand scheme of things that's actually pretty good. At the start of this year, Siki you experienced big peak load at the start of semester before the optimization process that I'll talk about. And this was taking four hours and it occasionally blew out to 10 or sometimes it was actually running in days and it was just absolutely choking. And we send a ton of email, like an awful lot. So the next question is, how do you then dig into it and figure out what's going on? Well, you want to profile it. The key thing we wanted to know was what's the time an email should have been sent versus when it actually was sent. One of the curious things we found is that email is a process in order. So if you've got an email it should be sent now, you have to wait for the next current process to tick off. So if it's four hours, it could be between zero to four hours. And if you're an A, you'll get processed then. But if you're a Z, you've got to wait zero to four hours and then at least another four hours. So it's not just a performance issue. It was kind of almost an equity issue. So yeah, profiling, you want to figure out what's going on, what's slow. I'm not going to go into this in too detail because you can do a whole session on this. But there's a built-in tool, XH Prof Tideways, which is great for isolating what's the most slow thing that you can then go and fix. I'll just briefly mention a couple of things that we fixed. They're not terribly important. Like there was one, there was a moustache template that was inside a loop. One was a capability check. There was a little bit slow. One was optimizing the way we actually queued up emails. We sort of tested send mail versus post fix in a couple different settings. Where we queued with upstream rooms, you don't need to worry about them, but it's more about the process of going through and shipping them off. There's generally not one thing that's like the silver bullet that's always sort of the small incremental things. So we sped this up. An interesting thing about the forum task is that it's actually slightly worse than many performance. So as you throw more work into it, it actually slows down. And what we experienced was it sort of got past that tipping point and it got into a congested state and that couldn't recover. So it was actually congested for a lot longer than it should have been after the load went back down. So we definitely sped it up a fair bit, but I was sort of unsatisfied. I wanted more. I wanted orders of magnitude, more performance, not these little sort of shaving a percent off here or there. So another thing about long running tasks that people don't think about is outages. If you've got a task that's long, it could be running for four hours and you're approaching an outage, one of three things is going to happen. Either it's going to overlap, potentially you could be changing the schema and you could mess up your database, bad things could happen, you just don't know. Or you could wait for Chrome to finish, push back your outage. That means you no longer have a deterministic, predictable outage window and we want to keep our outage windows really short if at all. So the only kind of sane thing is where you just turn Chrome off early. You let it drain out and then you do upgrade. And none of these three situations is good. So when we architected the link checking tool, we wanted to avoid this thing. Now the link checking tool, it's like Google. It takes, in CQN production, it takes nine or 10 days to run. We obviously can't run that in a single process. That would just be insane. And that's actually a very small subset. It's about 100,000 links, sorry, about half a million links and about 100,000 pages. So what we do is we break it up into lots of little short tasks. And this is configurable, we can change it on the fly. So we could say do a whole chunk for an hour, but then as we approach an outage, we can sort of dial it back to one minute sessions and then we can just turn it off drains quickly and we can get back into it. So there's a couple of trackers here about speeding up the fast shutdown. So not waiting for, you know, not picking up another task. So if you turn off Chrome, it just shuts off as soon as it possibly can. Another thing, if you've got a big computational migration after, sorry, you know, as part of an upgrade, you might want to defer that until just after the upgrade so that the actual outage window is as short as possible. And one of the way to do this is with an ad hoc task. Ad hoc tasks are really cool. They're very underutilized feature of the task stim in moodle. You can schedule a task to run right now, which is sort of like a poor man's thread or, you know, or you can schedule them for some future point in time. So you could create an ad hoc task, say run it as soon as possible. Cron's not going to run during the outage. Soon as Cron's back, then you run your task and you get on with it. It's good if you're happy with eventual consistency. Alrighty. So scaling. It's just a matter of inserting more money, isn't it? Surely. The main thing to consider with Cron is where Cron runs and how many Cron processes you run. There's two sort of, you know, as people grow, there's two sort of main architectures you see. People have, say, three front ends and they run in Cron on all of them. But eventually, sort of, the best way to do is to have dedicated Cron box. You can have Cron processes which consume a lot of resources and they have a different sort of load profile to your front ends. You want to isolate that so you're not adversely affecting your web front ends. For the same reason, we might route all of the API requests to a dedicated front end, which is tuned differently because we don't want to upset the normal usage. The, shall we just go back and set the, excuse me. The other thing about ad hoc tasks, which is actually kind of cool, is that they're processed in parallel. So if you queue up 10 ad hoc tasks and you've got a couple of different Cron processes running, they'll pick them up as soon as they can and you can run as many as you want in parallel. So how many Cron tasks do you actually want? Just excuse me, these slides are actually old, so sorry about that. So if you go back to that lock stats tool, if we go back to here, you might see that in your system, you've got a couple of custom tasks and they might be all taking 10 minutes. If you fire up a single Cron, it might choke on that 10, well, it might not choke, it's running that 10 and it gets locked on that task and then it might pick up another task in the next minute. So it will run them all in parallel but there's sort of this build up time until it can run them all. So a good rule of thumb is if you've got, say, five really heavy tasks that are running more than a minute, then you want five tasks, five Cron processes plus an X1 to take care of all of the short lived ones. And finally, there's just one more thing which is task shouting API that I've been thinking about which is sort of a middle ground between the API for ad hoc tasks which is good for lots of little small things but for things like the forum, it might not be the best architecture to rewrite it completely to use ad hoc tasks. Instead, you take a single task that might take, say, 10 hours and you just cut it in half and you've got two Cron processes paralleling it and working through it in parallel. And if you're not happy with that, then you can cut it again. You can cut it as arbitrarily as many times you want and run four or eight or 16 or whatever in parallel and scale up as much as you want. So that's me, thank you very much.