 Great. Thanks everybody for coming to my talk, scaling through a pandemic. Yeah, it's great to see everyone in person after so many years. And I think everyone's really enjoying their time here getting to chat with everyone and meeting people face to face. So who am I? I'm Kim. I'm the technical director and co-founder at Previous Next. And we make the skipper posting platform. And today I'm going to cover how we survived massive web traffic during the pandemic. So some of this talk is going to be about the skipper hosting platform and AWS, but it's definitely not like a sales pitch, it's not a product promo. I'm hoping to share some of the lessons that we learned through this that you might be able to apply to your environments, wherever you're hosting. Hopefully there's some good tidbits in there for you. Okay, so I'm preset entered. We heard that word quite a lot. And the virus affected everybody in lots of different ways. And we were living in a pretty fast changing environment. A lot of us were tuning in daily to government briefings, trying to find out how our lives would be impacted by this. Government agency, our government agency clients, they found themselves suddenly becoming the go-to source of information during the pandemic. People were visiting their sites for news, guidance, those kinds of things in unprecedented numbers. All of a sudden information and services from government became critical information. You know, what were the daily case numbers? What were the new lockdown rules? Where should I go to get vaccinated? Or as a business, where do I go to get my QR code to allow people to check into my venue? And in a very short period of time, we saw a massive increase in web traffic. This is a graph of the number of HTTP requests. Probably around mid-June, I think it is mid-June 2021. And you can see we're kind of trucking along. And then all of a sudden we get these massive traffic spikes. One of these graphs is we're looking around 360 million requests in a day. And these are daily averages too. So we're not talking, this isn't like the peak of traffic at one point. Like this is average traffic over a day. There were big spikes, especially around the morning briefings in the morning. And this continued well beyond this graph. So it kept going. And meanwhile, we were able to keep error rates very, very low. So this is a little bit of a flex, but we were able to keep our error rates to less than 0.01% during this time. So again, this is exactly the same period that we were getting this, our error rates with this. So something that we're very proud of. So what my talk today really is about is trying to give you three key ways to help you scale through pandemic type events. How do you, how do you do your infrastructure? How do you make your application scale during these kind of massive traffic spikes? So step one really is about having a solid caching strategy. So caching is the first and best line of defense really for your infrastructure. And finally enough, I think the last time I talked in Brisbane was in 2011 at Drupal Down Under. And I did, gave a talk on performance and scaling during that talk. So 11 years later, I'm still going. Okay, so first of all, just key to understand caching, it occurs at multiple levels along the way in terms of like a request going through. Firstly, the browser might not even need to make a request to the back end. If it's got a local cache copy, it'll use that instead. You could be behind a corporate proxy or your ISP maybe having a reverse proxy. It can cache there. And then obviously we get into the cloud infrastructure. So things like CDNs are doing a lot of the heavy work there caching for your Drupal site. And then your Drupal infrastructure itself has another level of caching. So you could have in-memory caches. You've got things like dynamic page cache in Drupal. There's a bunch of caches all along the way. And a key is to making sure that all of those are working efficiently. And you really want to try and keep your cache hit ratio as high as possible. So that's essentially how much your cache is actually being used, how many requests are hitting your cache as opposed to going through to your back end infrastructure. And you really want to keep this super, super high as much as possible. You can see like the recommended line is something that we use for most sites, making sure that sites are always above 75%. And during the same period, we were getting cache hit ratios above 99% for infrastructure. So very, very high. So how did we achieve this? Well, the first thing is to try and serve cache items for as long as possible. You want to try and reuse your cache items. Cache control headers are all set from Drupal. And then those cache headers get used in all the downstream caches all along the way. So we're just looking at an example here of what a response from Drupal and CloudFront might be, where it's giving you information about the cache headers. So we're looking at max age here, and that's in seconds. So that's just basically telling the browser to cache for 600 seconds, which is 10 minutes. And then we've also got another one, which is smax age. And smax age is essentially you're, it's telling the proxies how long to cache items for. So we've got these two different numbers. We've got the reason they're different is because once contents in the browser and the browser's caching, it's not making requests back anymore. So if you want to be able to control when your content gets refreshed or the cache expires, you can't really do that in users' browsers. You don't have access to their browsers, so you can't do that. So what we try to do is just set a very high time to live in the proxy cache, because we do have control over that. We can programmatically invalidate caches in the proxy, such as CloudFront. And these values aren't set in stone. Like they could be higher or lower depending on your situation. Like you may have a tolerance, a greater tolerance. You may have greater confidence that you're able to invalidate everything that you need to, and you could set this way higher, like 30 days, or probably go up to a year, if you're really super confident. And also it's going to be about how fresh your content team needs that content to be. So 10 minutes is kind of seen as a good point to keep the content fresh, but also make sure that they can make changes to content and it gets refreshed in users' browsers. So because we have this long-lived cache, we obviously need to be able to purge when content changes. So when a content editor goes in and makes a change, we want to make sure that that stuff is refreshed. And we're using the purge module for this. The purge module is kind of like a framework if you haven't used it before. It comes with a number of different plugins. You can have queers, you can have processes, and then actual plugins to do the actual implementation. So some important notes about using CloudFront as a CDN. It's quite expensive to do invalidations. So if you're doing invalidations all the time, then the bills will kind of rack up really quickly. And on top of that, there's kind of limits on the number of invalidations you can do in a single API call. So you're kind of restricted by how much you can kind of purge at one time. And because we want to find great control over purging, we actually wrote our own custom queueer. So that's a thing that queues things up to be invalidated. But you should really customize your own. You should use what your business requirements are around how your website is structured to define how you actually want to invalidate things. So for example, in Service New South Wales, they have a notification block that appears on transaction pages. There's special rules around how and when that appears. It's all throughout the site. So if someone makes a change to a notification, we needed custom logic to go and find all the pages that block appeared on and then go and make sure that those were added to the invalidation rule. And you might have something else like, okay, whenever I make a change to a page in this section, make sure the landing page gets invalidated because we know that we're showing that content in a list on the landing page, those kinds of things. So basically the rules around your site should influence how you actually build your custom queueer. And then after that, we've got what we use as a late runtime processor. So that's basically when a content editor makes a change, what we'll do is we'll turn the response to the browser and then using the symphony on terminate event, we'll then go and run the purge process. So that's something that you can use if you're running PHP FPM because that supports that. You can send a response before it's actually finished processing. If you're using ModPHP on Apache, then it doesn't work because it's got to send everything in one go. And then the CloudFront purge module is really just simply calling the CloudFront API through the PHP SDK. It's a pretty simple module that just makes a request. So obviously you need to set up things like authentication to be able to do that. Essentially, it's not super complex and it works. You can plug it in to a number of different purge module setups. So that's trying to cache things for as long as possible and then being able to purge whenever content changes. And then there's a couple of things that you want to do to protect your site if things start to go wrong. So CloudFront has a somewhat hidden feature to serve stale content of content that has expired essentially in the cache. So irregardless of what you've set your cache headers, if for whatever reason a request comes in, it tries to refresh new content from Drupal, but there's an error, it will just go, well, I had a copy here already that's actually old, but I'll serve that anyway. So that's basically serving stale content, but that's better than getting an error for your users. So this was kind of really important when we're talking about COVID pages. We really wanted people to be able to view COVID information rather than suddenly getting an error, even though we've got very low error rates. This is all very cautious. So to do that, you need to set what's called a min TTL. That's a minimum time to live in your CloudFront configuration, and then it will start serving stale content. And this can be as little as one minute, something like that. Just essentially trigger that feature. And then on top of that, we also configured CloudFront origin groups. So origin groups basically allow you to set up multiple origins, and it will go through the list and try the first origin, and if it can't get that, it'll try the second one. So we used a nightly cron job that basically just went and created a static copy of the COVID pages, and then stored them in an S3 bucket. And that was basically the second origin group. So if the first origin group is Drupal, if it tried to get a page, if it got an error, or it was timed out, couldn't respond, it would then go get a copy of that page out of the S3 bucket and return that. And again, that's following that principle of it would much rather serve stale content than an error. At least users are actually getting that content. Yeah. And then on top of that, static error pages. CloudFront can sometimes run into issues where it can't contact the back end. The default error pages are pretty nasty if you've ever seen CloudFront error pages. So we just set up some nice-looking custom HTML error pages. Again, in an S3 bucket, you can configure CloudFront to use those error pages if there's an issue content in the back end. So to try and make our cache as efficient as possible, we want to reduce the variations in our cache care. And what that means is when you CloudFront caches a page, it's caching by a combination of the URL, any headers, cookies, or query strings. We basically, what we try to do is only include those ones that are actually essential for Drupal. So you might see here things like Google Analytics query strings or Facebook query strings that get out on if a link is shared by one of those social networks. They essentially just dividing up your cache into multiple entries. If we're able to strip those out, basically means we're reusing the same-case item for that, and we can maximize the case reviews. Just to mention that things like session cookies are important, obviously, for content editors, so we allow those to go through. But when you're a content editor, you're not really looking at cache copies of pages. Again, we're just being very specific only including those cookies. So any other cookies that Google Analytics might send, we just ignore them. They never go back to Drupal. So now we've kind of built up our caching strategy. We're trying to cache things for as long as possible, get really good cache reuse. How do we know that we're still doing that? How do we know that some code that's just been shipped hasn't broken a page, hasn't caused it to be uncacheable? And this happens quite a lot. Even things like Contrib module updates can cause that. I think Admin toolbar did that recently, right? So all of a sudden, a page that you was finding before is not fine anymore. So what we do is we, well, first of all, we set up some automated testing for critical pages just through our CI tools. We make sure that if the ones that are really, really important, we make sure they're always cacheable. And then CloudWatch dashboards were really valuable in monitoring for this over time. I don't know if many of you saw Carl's talk this morning on CloudWatch dashboards, but if you watch the video later on, you'll get a bit more of a deeper dive into that. But essentially what we're doing, we're ingesting all of the CloudFront logs, so all the web traffic logs into CloudWatch logs, and then we're able to query that for any pages that aren't cacheable. So that way, we've got a tool there to be able to say, okay, how cacheable is there any issues? And then you can do things like, you know, you can set up thresholds and then set, you know, alarms or get notifications if pages are no longer cacheable, or if you've got a ratio of pages that are uncacheable on the site. And this is going further and further back in the stack now. So Drupal, as many people would know, you know, it's making a lot of database queries. A common problem with high traffic Drupal sites is essentially just database deadlocks trying to read and write from the cache tables in the database. So we can kind of sidestep this issue by using an external in-memory cache, such as Redis. And we can configure Drupal to use those cache bins instead of the database. And we're using elastic cache Redis on AWS. So it's kind of managed. It doesn't really have a nice way of auto-scaling. It's probably one of the things in the stack that doesn't auto-scale on demand. But it's, again, it requires a lot of load before we start reaching the maximum capacity of Redis. It actually became the bottleneck in our load testing at one point once we had gone through and fixed all the other issues. Okay, so that's step one. Step two, we just, and this is general infrastructure stuff. We need to make sure we can scale in and out on demand. So we need auto-scaling infrastructure. So we're, the Skipper hosting platforms built around managing Kubernetes cluster and other AWS managed services. So there's a couple of things we can do to manage auto-scaling with this setup. The first is what's part of Kubernetes, which is essentially horizontal pod auto-scaling. And if you're not familiar with the terms, the pod's just a wrapper around your compiled Docker images running on the Kubernetes cluster, and they run inside a node, which is on AWS, it's just an EC2 instance. So we've got a bunch of nodes. We've got a bunch of pods running in those. And the horizontal pod auto-scaler essentially just monitors for CPU and memory use. And if it hits a certain threshold, it goes, okay, we need to deploy more pods or remove pods. So essentially, the pods just go up and down. You can set a minimum and maximum number of pods in your deployment, and it will scale them in and out as you need. All right, so that's Kubernetes scaling. And then on top of that, you also need what's called cluster auto-scaling. So this is essentially when you run out of nodes. So your pods are running inside a node. When you run out of nodes, you need a way to tell AWS that, you know what, we need more capacity. And the cluster auto-scaling can do that, can basically just manage that for you. And the nice thing about this is things like EFS, those kind of things are going to be automatically mounted, so you can essentially add new instances and things will just work. And again, in terms of just things you get for free on AWS, so we don't really need to worry about scaling things like the actual cloud front service or the elastic load balancer service, even EFS, those things are kind of HA and scalable by default. We don't really need to intervene with those at all. And then just diving a little bit deeper into the stack. So as part of this, of course you need to be using immutable artifacts. So things that when you deploy them, you don't really need to do anything to add more or remove more. You've decoupled the state of the application from the actual application itself being deployed and running. So I know many people know the term of the 12-factor application. It's essentially making sure that your configuration and your state is external to your running application. And then Skipper is managing that for you. So Skipper is doing things like making sure that the pod has access to configuration, anything it needs, like database connections. And then file system stuff is automatically mounted for you. So again, it's kind of a requirement for being able to autoscale, is making sure that your applications are immutable. And you can do things like make sure you've got read-only file systems to prevent you from actually making changes to your application. Imagine you go and make a change, manually shell into your pod, make a change, and then it gets destroyed and then a new one created. It's kind of pointless, right? And then, yeah, so basically the Skipper package, when we actually deploy in Skipper, essentially all we're doing is compiling a Docker image, sticking it up into ECR, which is the container repository, and then deployments just pulling that back in. Okay, so PHP, a pretty simple one if people aren't familiar with how you can kind of properly tune PHP resources. So essentially you've got each pod has a maximum amount of memory that it can use, and then in PHP you're setting the maximum PHP per process, so how much a process can use. So this is really just about tuning, finding the balance of like, what is the sensible level of max memory? If you set, it's just essentially by total memory divided by the max PHP memory, gives you the maximum number of PHP processes that can run inside your pod. So the idea is to try and reduce the actual cost of a single process, so optimize down, reduce your max prox size so you can basically run more prox at the same time. Yeah, so if you're half in max memory you get double the prox, basically. So it's something to focus on. Okay, so we're up to the third point for scaling. So we need to protect the database, and by that I mean traditionally the database has been the bottleneck of most Drupal sites in terms of scaling. As you know there's many, many database queries going on in Drupal, so how do we prevent that from affecting the database? So first of all we use Amazon Aurora because according to the marketing documentation at least it provides five times the throughput of just RDS, MySQL, and we can take that and we'll leave it. Yeah, and it also has the concept of like a read replica, so you've kind of got the master and then the read replicas, and then that just essentially means those read replicas, they read only, but you could use those for things like if you're exporting reports that are kind of making big database queries, you can, because they're up to date you can run your backups so you can make sure you're not actually affecting your main production database by just using a read replica. And it also kind of, it also means that scaling up or making changes to that infrastructure is actually a lot smoother and a lot quicker just because it kind of does this it promotes a replica to a master and then swaps it over for you. So it's a nicer way of if you have to upgrade your database, it's not a kind of stop the world kind of thing and then wait for it to spin up and then move on, you can do it in a pretty seamless way. But more importantly is the RDS proxy. So we had a massive improvement when we switched to using this and the main reason is that Drupal doesn't use connection pooling. So it needs to create a new connection to the database on every request that comes in and creating database connections is very expensive. It adds a significant amount of CPU and memory on the database end and when we were doing early load testing it was actually like creating those connections that caused the application to bottleneck and that would just kind of chain reaction all the way up the stack. So what RDS proxy does is essentially it keeps its own pool of connections alive. It basically you just say I want to have 80% of all possible connections just connecting to the Aurora database it will just keep those connections alive for you. And then it's very good at creating new connections. So it's very fast. So Drupal is basically going I need a new connection and RDS proxy is going yep, here's one and then it just proxies right through back to Aurora database. Yeah, so that essentially made a massive difference and this is just a graph showing you essentially like a typical usage. You've got connections going up and down, being created all the time whereas this at the same time the number of actual connections from the proxy to the Aurora databases is just like a super straight line. So massive, massive improvement. And then once we employed this the throughput on our load testing just went way, way up. Essentially it went from I'm not going to quote the actual numbers but it was like 10 times the amount of throughput we could get. And then that's when we ran into hitting bottlenecks with things like Redis those kinds of things. So we actually moved it like very surprising to find something as efficient and purpose built as Redis being actually the bottleneck from then on. Okay, so this is kind of moving into how do you kind of reach this state, how do you get there. So multiple rounds of load testing were really valuable in helping us to get there. So I think we've all been there in the past, someone says oh you know do some load testing before you go to production. You go yep no worries I'll run AB the patchy bench. It's fine but you're basically just hitting the cache the whole time and the cache is pretty good. So what you need to do is actually make sure that you're actually testing your infrastructure. So bypassing the cache whether you're using a cookie or a session cookie or something like that or doing post requests make sure that you're trying to actually load test your actual infrastructure rather than just say the CDN or cache. We used K6 for our load testing. It's an open source project. You can download it. You can run it on your own infrastructure or it's a cloud service. You can run it from the cloud but there's many options available. It doesn't really matter. The main thing is basically just to try to having the ability to be able to measure what that load test is doing in terms of application performance. So you can use like New Relic to look at the application what's going on, what actual code is running, try and fix bottlenecks with your code and then we're using CloudWatch dashboards to basically see what impact on all of the infrastructure. We try to get as many metrics as we can to actually see where things are going wrong. Again, I mentioned Kyle's talk but this is something that's really, really been really, really useful because we're able to go that's another problem. Let's dig into why that's slow. So doing that repeatedly fix a problem, doing more load testing over a period of time is going to make sure that you get to a point where you're actually comfortable with the performance of your application. It doesn't come through just kind of just throwing it out there. So just a quick recap when we're in the pandemic government services became critical everyone needed access to all that information and that had a massive impact on web traffic on these properties and we're able to respond in three key ways. So first of all having a solid caching strategy, making sure you're running on auto scaling infrastructure in our case, Kubernetes on AWS and then making sure that you've got an efficient and protected database with an RDS proxy. Of course all of this is on Skipper. It's provisioned and managed by Skipper and thank you for your time. Here's my contact info if you want to get a touch. We have a mic if anyone's got any questions. There's a couple here. If it gets too technical I've got Nik on standby here to answer questions. So you briefly talked about Drupal caching as well. What's your recommendation about big pipe with this kind of stack? Do we need to enable it or just keep it disabled? I don't know. I don't think big pipe solves the problem for everybody. I know there's a few issues around it but I didn't mention actually there's a we did a we had our own thing to say like the XS max age that proxy caching header there's actually a contrary module I think it's called cache control that actually probably does that and more. I don't think we've used that but that's something that you could look at as well. Sorry maybe next you can answer questions about big pipe. Who's next? Hey Kim thanks for that. Earlier on in your talk you talked about cache invalidation. Did and I think there was something about you're using some kind of custom module to help invalidate cache. Is that using Drupal cache tagging? No. So the custom code is really it's just a custom plugin for the purge module. So the purge module has all of these different types of plugins. The custom code that we rewrote was a plugin that just tells it what to queue, what to add invalidations for. So it's still using the rest of the purge module and I think there's a tags module but we just found that that was way too inefficient like it was doing way too many invalidations and we just found that we wanted really fine-grained control over how to invalidate pages. Hey Kim thanks for that. Question about the origin groups when you had the failover to static snapshots. Do you have any data on how often that was here? Do we have any data on how often that was here? No and on a lot of this stuff again and I mentioned before things like static error pages you know failover we were just being super super cautious so I don't think we ever actually used it. It was there just to make sure that we had a fallback if something went wrong you know we wanted to make sure that we weren't going to it couldn't fail basically you know so we had to make sure we had something in place to support that. Question about content purging. What if you have a daily import process that will update thousands of records daily because I had previous project where we had content purging but it overloaded the content purger just because there were so many records okay how do you handle that situation? Again probably you'd use I think I mentioned it before but we do have content that gets changed on a schedule so we have like a nightly you know there's chron runs that run and they actually update publish status and things like that. We also run like a drush purge command line as well and again it's about the efficiency of your your invalidation so you know you pay per invalidation but you can do wildcard invalidations as well so again it's going to be based on your custom setup but if you know that every night you need to purge everything under reports slash star then that's one invalidation you're not paying for that and then you're fine. Question about the PHP resources you mentioned that you want to control like memory to have more processes running inside the container so and kind of part of that are you running multiple processes inside one container? No it's more about the PHP FPM process pool like we want to know we want to try and maximize the number of processes that we have running in a single in a single pod right so we can do that by making sure that we set a max PHP sorry PHP max memory to a low enough level that the application is not going to run out of memory but we're still maximizing the number of processes right so yeah we're just running PHP FPM we're out of time okay thank you everybody