 Okay, yeah. Welcome. So I'm just going to go over some of the experience that I had scaling Sailor Academy's installation of Moodle from basically the beginning up to over a million students now. Okay. Okay, so first question is probably who am I? So I'm John Aznieta. I'm the former director of technology at Sailor Academy. So I worked there for 10 years and I love the technology operations for seven years. Next question. What is Sailor Academy? No one seems to know. But we're a nonprofit that was working to offer free and open online courses to everyone who wanted to learn, both in the US and around the world. The model is a little different than most places really because it was an open enrollment. So anyone can enroll. They can enroll at any time. The courses were always open. We had no semesters, no terms. They didn't close whatsoever. It was totally self-paced. The student could go through as quickly or as slowly as they needed. There was no one telling them they need to do anything at the end of the week, nothing like that. We also didn't have any teachers. I mean, the courses were developed by different professors who would all contribute, but there was no one professor who would make their course and put it up and then go along and help all the students, something like that. We also had a global audience. So the initial focus, like I said, was on the US, but then it expanded to a totally global audience. Also a little different from other nonprofits is we had a sole trustee. So it was great because we didn't have to worry about fundraising, but we also had one source of income. So basically the bottom line is it's kind of a unique use case, a unique workload. I think most LMSs, most systems aren't developed with that kind of model in mind. So even Moodle, but I think that just kind of talks to the extensibility of Moodle. So this is a map of the students. You can see we had students at least self-reported, almost everywhere around the world. The majority were in the US and now in India. So the catch, I feel like every conference I come to, every Moodle Mood, is there's always a single scaling presentation about we have this many students, there's a big number, and it's multiple instances, or they had single semesters and they take students out, that kind of thing. So really it's the context about it. So our context is that we had recently 1,371,000 students and it's a single instance in the database. But that's students over the last 10 years, because basically they're never removed. They can log in at any time that they want. To give you some context about like enrollments. So the top enrolled course is ESL01, which had over 173,000 enrollments. So it's a big database table. Just some more context. We'd have about 15,000 to 20,000 sessions per day, 104 course completions per day, and generally between 100 and 400 concurrent users. So it was basically a really large student count, a lot of enrollments. I would say moderate traffic load. I know there are installations that definitely would have a lot more concurrent users. And I would actually love to talk to people who have more users than that concurrently like how they did their stuff. So the infrastructure. When we first migrated to this Moodle instance around 2014, 2015, the course was originally on WordPress. And with a separate e-portfolio system where people could kind of keep track of their progress on their own, that we then had to like sync up. But this was just a basic, highly available setup. So we had three different Moodle servers in three different availability zones on AWS, RDS instance, and then the Moodle data which stored in a shared network drive based on cluster FS. Again, it sharded three different AZs. Yeah, and it worked. Discourse is a separate forum software. At the time the Moodle forum wasn't as social as we wanted. So we wanted to try out something different. So we pulled that out and had the students chatting and discourse. So the first problem that we had is that I had to schedule long down times. There would often be two hours and I would run up to that sometimes. Basically the upgrade process could take a long time. I would run my upgrade scripts, pull in the new plugins, and there would be a version issue. So that's what I do. So I implemented a build system and a deploy system using Jenkins. So at all our plugins listed the branches that they needed, encode, so it was all version controlled. It would run a test, copy the production database, so student data in a controlled environment, pull in the latest updates that we wanted, run a test, and if everything went, then it made a build artifact put on GitHub, and then I had a separate job that would then deploy. So that actually saved a lot of time, and it brought our actual downtime to less than two minutes for updates. So at that point I didn't have to schedule any downtime actually. I would just run it, it would be like 30 seconds. Because it was already tested, I knew it would work, I knew how long it would take. It was built, we just deployed. So that brings me to the recommendations. The first one is use the CI CD system. DevOps has been a thing, it's cool. And so this is what the infrastructure looks like after that. You can see we just added another service on the side of the Jenkins built server. The next problem that we had is why is logging in so slow? After like a year or two, we noticed that there was like some general slowness sometimes, especially during the session creation, and we dug in a little bit. And it turns out I had the session handling handled by the default drivers. So it was done through the Moodle data service, so in that Gloucester FS instance. And then later I switched to the DB12, but these are very slow drivers. So we switched to memcache and then later Redis. And that's sped up logging in, sped up sessions tremendously. So I'd recommend number two, use caching. You can set up separate app caches, separate session caches. It really speeds up the Moodle system. And so this is what the infrastructure looks like a little bit later when we went to Redis. So we had a separate session cache that could be a little bit smaller, and then the app cache for caching to the application files and things like that. Okay. So, you know, we're working along and we had another day and another performance degradation. So the problem is that we had some reports of slowdowns. We had some CPU spiking on multiple machines, and they would get taken off the load balancer. This would happen at regular intervals, and I was kind of looking into it like what happened, what was going on, what's happening here. So it turned out that the reengagement plugin was taking a long time to complete because we had so many students at that point that the next scheduled task where it would start to look over to see who completed and needed to be sent a message was starting over again. And it would just kind of spiral. So it just kind of took over the whole server at that point. Just to quick aside, what is the reengagement plugin? It was our initial foray into student interventions. And it allowed us to remind students to, well, it lets you remind students to complete different activities. We were using it for telling the student if they failed a final exam, that they could take it two more times, and then give them a little more information about what they could do, and maybe some next steps to practice so they could pass the next time. So the solution, first triage, increased the time interval between the scheduled task. So that way it could finish. Unfortunately it made it so that students wouldn't get a intervention email as frequently, but it still wasn't that bad. But then later I did have to remove the plugin entirely just because we had too many students in the background. But that brings me to the third recommendation. Check your plugins, test them out on something that has the same number of students, same number of enrollments. Make sure that it's not doing any sort of like for loops and going through all the students, all the users. I see that happen in some plugins and for us like we couldn't use it at that point. And this doesn't just apply to plugins, it also applies to some Moodle features too that you may want to disable or not rely on. So there's one time, so I set up this API called the Program Partner API for some of our partners that sent students to us so they could get information about their student group to see who's completing, what their progress was, that kind of thing. So I was working with them one time, I went over to go meet with them, I was hanging out with the engineers and they said, hey, this one call is taking a long time to complete. Do you know what's going on? Did some research? And then power Moodle and open source, I could check the code and see what was happening. But basically they used a wild card at the end to match up emails and this one function would return all the users, go through all the users, do a permission check, and then send the information back, which was taking a long, long time. So I rewrote the function. Well, first I told them not to do that and they really only went using it, so it was okay. Put a note in the documentation and then I rewrote a function to only check the cohorts that they had access to first and then do a permission check, which brought the time down substantially. So yeah, so check the features too, it's not just plugins. So you might be thinking, what about the student interventions? You took out the re-engagement plugin, what did you do? So we actually instituted N8n, which is a free, it's open-ish, self-hosted, fair code licensed workflow automation tool similar to Airflow, that kind of thing. We did this so they could consolidate the student interventions and messaging into a single place rather than having a course completion email going out through Moodle or a failure engagement plugin, that kind of thing, they're all centralized. It also allowed us to have greater control over messages, more granular logic, personalization. One thing that we did was implement in the emails how many times that they had taken the exam, so how many times they had left that they could. We could tell them exactly when they could retake the exam because there was a two-week period where they couldn't, they had to wait. And we found a lot of students would wait like 13 days or about that, or almost 14 or an hour before it would reopen for them and get a little confused. So we're able to do that by pulling that out into a separate system. I think another benefit of this method is that it produces load on the learning environment, which is really what I was trying to do the whole time with Sailor, just trying to make it so that the learning environment itself was performant for the students. And that brings me to the fourth recommendation, which is pull out what you can. There may be other tools that are more suited for what you're trying to do for a future. Definitely not to knock on Moodle, it does a lot of stuff and does a lot of things well, but if you have a lot of students like we did, sometimes you might want more control, more customization. This doesn't just apply to like Moodle functions either or features. We started to do this with resources too and courses. So I would recommend thinking about putting large resources into a CDN because the shared network drive, which is where you will upload course files into, like if you have a course video or images, that's going into that Moodle data folder and shared network drives are just slow. They're generally slow, especially videos. So that's a pretty big performance cost if someone's like trying to watch a video from that. So I think we have some examples. We did one project where we're migrating our course videos from YouTube and then we put them into an S3 bucket behind CloudFront. Again, another service that's designed for caching stuff for bringing large files to students at the edge. And that will really help performance too. I also have another tip because we also use the Moodle mobile app. It's not going to cache links to YouTube videos, but it's not just if you upload files into courses, but if you have included a link into an MP4 file, it's going to cache that as well, which we found really helpful. This also applies to infrastructure. So we had more traffic and more background processes lately that was just causing slowdown sometimes, causing some resource usage. I noticed that we also implemented a new antivirus which used more RAM, so I had to bump up the instance sizes. And it was just when that would spike and you would have these scheduled tasks by going through a lot of users, it would just slow down performance a bit until the server maybe got pulled out of the load balancer, maybe not. So what I did was I pulled the Moodle Cron out from each individual web server. I made a different administration tier. And then that way the Cron job is being run on these other computers that students aren't necessarily connecting to. And that way they don't have to have a performance issue. Like the Cron job can spike the CPU, spike the RAM, do whatever they're stashing. It doesn't matter because the student's not on that server. So this is what the infrastructure looks like at that point. We have the web tier that we originally had, the new admin tier, where we could also do background tasks and take some more resources to like course backups, that kind of thing. Your session caching, application caching, share drive, database. And 8n, which we connected to a read replica actually. So that also didn't put any load onto the database for Moodle. And we'll put that all, or the files behind the cloud front. So my last recommendation for you guys is to use and contribute to open source. Like we could not have done this without Moodle, without being open source, without being able to go through the code and just see what it's doing to fix bugs or have an improvement. And I think if you use open source, you need to contribute back. So those are my recommendations. Got any questions? Hello. First, thank you for a great session. My question is, if I record correctly, you mentioned that when you were talking about the caching, you first started with Memcache, and then later you moved to Redis. What was the reason for that? Did you find some performance improvement when moving to Redis over the Memcache, or what was the reason for the change? I think, actually I think it broke. And that's why I did it. Originally there wasn't support for Redis. And then I think I had some issue with Memcache and the PHP drivers had to move over to Redis and it worked a lot better. So at this point I recommend Redis. Okay, thank you. With perpetual courses, things can get tricky with course completion. How do you handle versioning? How do you handle that? Not well. That's something that we've actually been trying to work on, just telling the students that the course has been upgraded, it's been changed. At least as far as exams go, so that the student still has that completion and they can still look at their exam results and everything and that, so that's still there. And then a new version of the exam comes out. So they've still completed the exam, or they've completed the course. They still have the exam completion. There's just going to be a new exam that they didn't complete, but I assume at that point they're not going through the course. Hello, thanks for the talk. Some months ago, Anduin from Headquarter warned in a web session against using Glastafs for Moodle data because of its bad performance with a large amount of small files, which Moodle data obviously is. My question is, have you considered alternatives and have you encountered performance problems with Glastafs? And a second question, are you using the Glastafs client on the web servers to access the data and NFS to access the Glastafs servers or another method? Yeah, so we were using the Glastafs client because these were Ubuntu boxes. So we're using that to mount. Yeah, I agree. I had a lot of performance issues actually. At this point, that's where the bottleneck in that infrastructure is. And so I started looking to alternatives and start to test those out. The next thing I wanted to test out was EFS and see if that had any better performance, maybe. Check out stuff, something like that. But that's the next project, definitely. Hi. You mentioned that you're running this on one database. If I may ask, which database are you using? Are you doing any replication or high availability? And how do you handle concurrent sessions? So for the database, so this is all in AWS. So the database was run through RDS. We're using MySQL since it was 2015. So we're using MySQL. Yeah, it's RDS. So it was the multi-AZ setup. So we could do upgrades and there wouldn't really be any downtime with that. And we could scale it up. Does anyone have any AWS experience in here? Oh, okay. Yeah, so at the beginning it was basically T2 Smalls. And then I brought it up. So we're using T4 larges for the web servers by the last slide more recently. The database went up to larger and extra large. And do you have any more detail on your questions about the concurrent users? I'm not sure what you want to know about that. I meant mainly load because you would have, at times when so many users are connected at the same time, the database would malfunction because of the number of concurrent sessions. In other applications, the concurrent sessions are handled on the web server and not directly on the database. And I was curious about your architecture to say are you handling the connections directly to the database? And what's your mix? Well, I guess I'll let PHP and Moodle handle that. Between the web server and the database, it sets up its connections. I didn't talk about tuning. So obviously we've done a lot of tuning and tweaking for different settings for that kind of thing. And off the top of my head, I don't remember our settings for that kind of stuff. But I think most of the load for the students, at least for the concurrent users, is coming from those multiple front-end web servers that's running the Moodle application. So that's all behind a load balancer. So it's going between each one, sharing the load. If one of them slows down, stops responding, all the students get sent to the other one. That's probably where you're going to get more performance increase. We're going to talk a little more after if you have some more specific questions about that. Hello, two questions. Have you considered auto-scaling capabilities of AWS based on your load to save some money? And the second question, are you using any monitoring tools to monitor the bottlenecks or the issues that you've explained? Yes. So we've both scaled up a little bit. Next step would be scaling out like we did with the admin tier. And then depending on... I was really looking at the amount of RAM that I think would be the biggest issue, which is why I scaled up the instances. So that's one of the benefits of using something like AWS. So we definitely did that. And then it would help... Actually, sometimes we'd run into that and we'd fix it. But I think you need to not rely solely on just scaling up and brute force like that. You have to think about caching and all these different layers and where the bottleneck is. I just have to ask, the core user-get-user fix, was that contributed back to core? It wasn't because I was very specific to our use case. If anyone thinks that would be helpful or a separate thing, I'd be happy to because it went through and looked at what cohorts that user that we had set up to access the API, what they had access to in this separate database that we had,