 Hey everybody, welcome to Drupal South Shorts 2021. Hope you're all having a great day. My name is Scott Massey and I am the Managing Director of International Regions for Pantheon and it is my pleasure today to be officiating or moderating this session with two stalwarts of the Drupal community in Australia. Today we have Nathan Turbot, the tech lead and Simon Hobbs, the solution architect for the Department of Customer Service for newsouthwells.gov.au to talk about something I think I've thought about every single day, probably for the last 15 years, which is high availability and Drupal. And so they're gonna talk about that and share their thoughts on that. I'll moderate the questions and I've been told to interrupt as needed. So feel free to post your questions and we'll fit them in during the talk or we'll have time for questions after. So with that guys, take it away. Thank you. Hi everyone. So Simon Hobbs here. Thanks for joining me and letting us share some information about knowledge about how we're approaching high availability for newsouthwells.gov.au. This presentation I'll touch on the following things. I'll give you an overview of what we call one CX project. Talk about the performance expectations on newsouthwells.gov.au and run you through just right briefly what our infrastructure looks like. And then I'm gonna delve into specific topics such as case sheet ratios, PHP footprint, database strategies and also looking at types of transactions. So that I call caddier versus a non-cascible types of traffic and how we go about approaching them. So the one CX program is our name for our consult effectively our project to bring a whole bunch of websites into newsouthwells.gov.au. So to effectively avoid duplicating content and avoid running up microsites when we don't need to. So we wanna go from like a fractured landscape to a consolidated landscape for our content. We're supporting of open source. So we're building up some very nice open source practices. And some of the reasons like why it's good to work for New South Wales, this is the semi-pitch I guess is that, you know, we're making an impact especially around COVID at the moment. We have some challenging projects and we've got a really nice flexible working environment and a lot of diversity in the workplace. So today here we've got myself and Nathan. So we sort of, we sit in two separate sections in the project. So Nathan's technically in the platform operations team and our work is a solutions architect in the one CX delivery team. And you can always find us on Drupal.org Slack or in those channels that I've listed below. So if you are not already in Australia NZ, please join and we'll hit 500. Big disclaimer for today is that we're sharing our experience for our particular project. It's not supposed to be a catchall for every Drupal high availability project. So your mileage may vary. We're not telling you how to run your site. Performance expectations. So we've been pressed by COVID. We went live with the new New South Wales.gov that are you nearly two years ago now. And you can see from this graph that in about March the traffic started to hit us fairly hard we thought at the time. But it wasn't until this year that when we were getting some cases of two million active daily users that we were sort of really experiencing traffic at its finest. This graph just shows the data aspect of that, pumping out a lot of data. I don't know exactly how the Alexa ranking works but Alexa has been ranking us quite sort of significantly. You're seeing it starting to drop off now as the COVID, I suppose the impact of COVID becomes less. Probably the biggest impact for us is that we are unable to change our DNS for easily anymore because AWS flags us as a top 1,000 site. So we have to go through extra hoops, which is fun. But probably one of the biggest things that we need to handle with our site is the diverse functionality. So we effectively were a single Drupal instance that's serving whole of government for New South Wales. You can divide that between service in New South Wales and customer service in New South Wales. We're not responsible for you changing your driver's license and so forth. But that said, we do have a lot of data solutions on the site, COVID services. We've got hundreds of forms when you look at some of the stuff we're doing with COVID. We have over 50 editorial groups and devolved authoring and we have about at least five development storms at any given time. So we have a bunch of stuff going on. So some people say to me, oh, we're using GAP CMS. And so it's just worth just clarifying how we differ from some of the other major projects. So SDP and Victoria, they have a Drupal content instance and then they have decoupled front-ends from that. GAP CMS is just a bunch of Drupals. Drupal, Drupal, Drupal, Drupal, Drupal. And we are much more similar to Mass.gov in our structure. We're a single Drupal site with that front-end built in. We do have a decoupled road map, but we don't have a distro at the moment. Like we are basically a single site. Our architecture is roughly like this. It should be pretty familiar to anyone who is worked on any large Drupal site in a sense. We're running on the Skipper hosting platform. So basically most of our stuff is pretty vanilla AWS containers and so forth with that Skipper orchestration kind of sitting underneath it. So the Drupal is pretty obvious. We've got our PHP FPM pool and NGINX pool and Aurora and Redis, which would be familiar to most people. And we have our, we use CloudFront and we've just more recently implemented our API gateway and elastic search components, which is going to be handling some of our more chatty services. And then in blue, we have a lot of integration endpoints. Whoops, I've just gone forward without meaning to. Let me just go back again. We have a lot of integrations in the back-end. Obviously we have to send emails, but we have a lot of CRMM points in the back-end. And we integrate with Sejari for the main site search. So one of the biggest, one of our key goals we have as a platform team is to try and keep our case sheet ratio as high as possible. So effectively we're aiming for 97%. When I, in last year, for example, we weren't doing too badly. We were just getting a lot of drops when cases were clear for whatever reason. We had to keep our cases to 15 minutes kind of universally to make sure that stuff got cleared. And it wasn't until that we implemented some specific clearing strategies and basically kept our case on that we were able to kind of get that consistent sort of 95-97% case sheets. So that's really important to us because we basically see ourselves as a public content site that needs to stay up and needs to be unclassified so that we can just serve it. If a case is, that's fine. And just to give you just a focus in on what that looks like, we just have these overnight, these little dips are just about more about stuff that happens overnight than what you see during the day. So our approach to content is to make it sticky, but maybe not too sticky. Currently we are doing 24-hour cache for our content, for Drupal. We have CloudFront configured in a really transparent way, which is always good. So it's respecting what's coming out of Drupal. We do have this little bit of logic which says, and this is something that Nathan implemented recently, which is working quite well. So we take what Drupal says, which is, say, 24 hours, and then we set the S max age for CloudFront to that 24 hours, but then we set the max age for the browser to 10 minutes. And that's based on the idea that most of our traffic is people coming and visiting the site for under 10 minutes. So for any of that Drupal dynamic content that they get, if they come back the next day or a few hours later, that'll be clear. And that gives us the ability to clear some sensitive things if needed. So one of the goals in here is about a high availability strategy that you can deploy is just to study and build your site from the Drupal container. We opted not to do that because we have a whole lot of pass through, so form data and things, content editing experience, stuff that we want to be more dynamic than a static site. The goal here is to pretend we're a static site. In the CDN, to all browsers everywhere, it looks like we are for 24 hours. So anywhere we can, we're updating cache headers, we're updating max ages, we're doing everything we can to make it look like we're a static site. An addition to that is that if the Drupal instance, in this case, returns an error, CloudFront will still serve the old-page cache to the user. So from that perspective, we look static as well. That's the end goal. Yeah, that's good. That's not something I show in our slides out using the cache CloudFront style cache if we have a problem with content. So just to briefly run down how some of the strategies we use to maintain cacheable pages, we search through your code base and look for all those max age and set max age snippets of code because it's quite common to see in a custom block or something that the block will set its max age to zero and that will pollute the whole page as max age. Obviously, look around for anything that's starting sessions and never set cookies on your server side. And if you've got a HA site that you're trying to resolve issues in terms of your caching, have a look at the code in this project called RenderViews because it basically demonstrates how to inspect the render array and pass classes through to the front end so that you can visually mark up page components which are setting a cache age of zero or some other cache age. So clearing CDN, so part of this caching being out of just cache and set and forget is actually having a strategy for proactively clearing CDN. So when we decided how to do this, we did look at the purge module and we determined that it was a lot of code for something that we were going to want to have a lot of control over. So basically what our setup is, it's pretty. Like this diagram doesn't tell you anything interesting in the sense it's just showing that when content is updating, we're connecting to CloudFront with an IAM policy so a service account in response to say a content update, so hook update. We're calculating content that is affected by a content change and to do that we're using, we're mainly doing that through entity hierarchy module and entity usage modules. So pretty much we just build up a list of pages that are affected by this. For our site, pretty much everything is a node and because everything's a node, so a lot of those modules give us a lot of coverage in terms of the things that are affected. Yeah, one of the additions I wanted to add to this one was that part of our strategy was about identifying key content at what mattered and what didn't matter. So for instance, we were talking about the purge module. The purge module uses cached tags and says, if this tag clears, clear all this stuff, we want to be more deterministic than that. So it's things like, if you create a media release, then obviously the media release listing page needs to update because some of that is a new one. But on the 20 other pages where there's related media releases, we actually made an informed decision to, those things can wait 24 hours. That doesn't matter, we're not gonna clear those. They can stay static and after 24 hours that related content will update. It's not actually gonna affect the user. So it's about being very deterministic as to what you clear and when to keep your cache values high. We added on top of that a form for content editors to manually clear pages so they can enter a URL into a form and it will pass some invalidation requests straight through to CloudFront. And that's pretty much a liked form because it gives them that ability to pick stuff up that's not being picked up. Cool, Nicholas in the crowd asked if those CloudFront invalidations are expensive or not? We, yeah, so I covered that in this slide actually. So the 24 hours is working really well for us and we think we can extend it if we wanted to but we sort of don't feel that we need to. So that 24 hours is kind of a sweet spot at the moment. We know that like CloudFront as opposed to some other services doesn't have support for native cache tag invalidation. So when we were looking at how we wanted to approach it obviously we looked at that. We also had to look at the fact that after a thousand invalidations per month CloudFront starts charging. So we're trying to find basically at the moment where we've found we're in a comfortable sweet spot but because we effectively control that code with their own class which we feel like 500 lines of codes in our own class is kind of a lot simpler than using purge module and all of its sub modules. We can kind of, we're able to put things in a queue potentially make that more efficient if we need to. Start to pair back some of the really chatty invalidations that we're doing. So we're gonna be monitoring that and then making changes as we go. Yeah, one of the things we do in that exact space is as I said, we have a queue and we put the invalidations in the queue and then we clear them every three minutes I think. I'd have to double check. But the reason for that is because if 50 content editors make 50 content changes all in the COVID section of the site and one of them clears COVID star then we can make some assessments about, okay, so there's all these clears but COVID star covers 90% of them. Let's delete all the other clears and only send the COVID star because we don't need to clear all the sub pages separately. So that the queue management for that is purely about cost management. We could clear with every single say but we'd be sending hundreds of changes a minute whereas doing it this way, we can kind of limit that scope. So PHP footprint, what we found when we were looking at PHP, so when we were looking at some performance issues particularly around Redis and how much time Redis was spending kind of serializing and deserializing the stuff we were giving it on a page load or a cache clear, we found that there was quite a lot being, quite a lot of the footprint of PHP was around the plugin cache and more particularly around block plugins so discoverable blocks. For me, this was the first time I'd experienced this because I had not done a lot of layout builder sites but layout builder has a block discovery mechanism that looks at all of your fields and all of the base fields and all of your entities. So as you can imagine on a very large site we have a lot of entities, we have a lot of fields, we have tech debt around fields. And so when we looked at the, when we actually had a look at what was happening in hook block altar, we were able to actually prune out quite a lot of quite a lot of that data that we just weren't using. So an editor was never going to want to place the revision ID into a layout of a page. So we didn't need to make that block available to layout builder. So we do what we would call fairly aggressive trimming and that got the size of our PHP footprint down much smaller, which helped reduce the issues that we're having with Redis and pretty much allows us to run more PHP instances. As just in terms of PHP, we would say like we set our goal of, and partly this goal was already set by previous next which is I think a very good goal around keeping a really ambitious memory limit. Like you can go with 256 megabytes for PHP but you're just going to fill that space and you won't know when people are filling the space in your team. So be ambitious, settle limit, apply it to all environments, apply it locally, but obviously a local developer can override that locally, but it's great to start by default. And remember that abstraction has consequences. So when you're adding that module, that's going to look at all of your fields and then do something in response to that. It can have consequences. Look for fields combined with layout builder. I'm sure there's other examples, but that's the one that hit us and use hook block old to inspect that cache data. I want to reiterate on the cache, setting the 128 meg locally, set it everywhere because otherwise you'll find like we did that something lands on prod and things start to break because the environment that you tested it on all have higher memory limits. So make sure you do that. Hey, we had a question just about the 500 lights of code. Is that a custom module that handles the cache tags or does that do the invalid, just do the invalidation on cloud front? So the module does, we call it the CDN module. What it's doing is going through, so it's doing the clear, it's sending the invalidations, but it's doing it based on a whole lot of rules like entity hierarchy and entity usage and URLs and things like that. That's what it's doing. It's just building that list of things to clear and then putting it in queue. Gotcha. So the balance for us is that a lot of that is business logic specific. We looked at reusing a couple of modules and we found that it was just a bit easier to implement our own class for a cloud front invalidation class. And then just the reusability wasn't quite there. Like there was just some assumptions in the modules we looked at in Contrib. And yeah, so and because the waiting was more along the business rule side for us, we ended up going with custom code. I'd hope to think in six months that it would be possible to kind of like share that and maybe abstract some of that out into Drupal.org. But the space has already got modules for that. So there's definitely stuff out there that people can use. And then to this, we went internal database strategy. So the main thing I want to talk about with that database was just around the RDS Aurora. And there's a lot of things you can say about database and we've had issues with adding indexes to things and what there are. So that, but I want to just talk about the bottlenecks that we were finding with Aurora by default. So this was mainly with Nick Shoe and Nathan were working on this as a result of DDoS testing that we did. So we had a company do, we have been a company who works with us and they have approvals with AWS to spin up many services, many containers to start hitting our site from everywhere. So it's really hard to manage that. So we can actually look at what would happen under a proper DDoS situation. And so a big bottleneck turned out to be Aurora. We found that our writer was spiking under CPU loads and then once we started getting deadlocks, it was game over for us. So the reader is auto scaling, but switching to the reader is a manual process and it's read only. So the writer was where it was at and it wasn't handling it. It was a quote I found on the AWS RDX proxy page from Acquia, which I think summed it up. The lack of support for connection pooling natively in Drupal sort of just means that we're not using, we're doing a very expensive operation by making a new connection to RDS every time we wanna connect to it. So once Nick and Nathan implemented RDX proxy, we found that you can see from the graph on the left there that the mountain on that graph is the CPU starting to spike under load. And then at the point where they turned on connection pooling, it all smoothed out again. So we're finding with that solution that RDS is handling our traffic quite well and that other bottlenecks in the app turning up. But maybe I'm not very good at explaining that subject. So I'm just gonna give Nathan a chance to anything. Yeah, so it's just purely about the fact that, so every connection to the database was expensive. It was doing our CPU. We don't want the database Aurora to spend time on making connections. We wanted to spend time on doing inserts and selects. So offloading it into RDS proxy. As you can see, the CPU just dropped straight down but our number of connections shot up in the second graph there. And that was because the RDS proxy was sitting in there taking that CPU load off the server and letting the Aurora server do the queries. From our initial testing, it went from, I'm gonna use two numbers, but they'll be different for everyone. We went from 2000 queries per second to 22,000 queries per second just by changing the two user proxy. Well, that's before the server fell over. I just wanna, the last section of the presentation is just talking about, I suppose, making objective distinctions between different types of traffic to your site. So this is where everyone's experience will differ. Because we have to handle such diverse range of things on usurphwiles.gov.au, this is really important area for us to have repeatable solutions for different types of things. I will generally delineate different types of traffic as really static and cacheable stuff which is just say viewing a page versus searching and listing which is really dynamic, really chatty, but it's still cacheable under the right conditions. And then you've got things like form submissions that actually has sensitivity attached to it as well as being non-cacheable. Like you can't cache that post request. So generally speaking, this is our breakdown of those different types of transactions. And the real one, the real problematic one for us, I'm not really talking about personalization here at the moment which, well, I sort of am, but the kind of traffic that's gonna go through to Drupal is gonna be that form traffic and we're possible, we're offloading that into third-party events. So if we are giving it a scenario is that, so there's a regional health district wants to bring content onto newsuthwiles.gov.au and they have an application form that they use for support workers who need to apply for some sort of rebate or something. So our first question is, what are they currently doing? Is there's something that we can embed on the site that they're currently doing that we can just simply embed in the front end and we don't have to add that into Drupal? It seems counter-intuitive to them because they generally feel like they're coming onto our platform and so they're gonna do things our way and yet when they say that they use MailChimp for some subscription thing and we're saying, that's great. We'll just keep using that. That kind of is good for everyone. So we do a lot of that where we take anything that's chatty and we really try and push it to the front end and that also applies to our personalization strategy which is emerging, which is around, it's getting Drupal to serve something by default and then personalizing it on top of that. So yeah, protecting Drupal from chatty services. So yeah, be firm about caching. Avoid personalized content from Drupal. So you can't be having PHP that's looking at a user's lat long and then returning related results in PHP. You can be using a lat long of Sydney or some generic one that's reusable because that's effectively potentially cacheable. Everyone who comes from Sydney to see some results of locations of clinics in Sydney they're all gonna get the same content. But just really be looking out for that personalized content on a site like nissethwiles.gov.au there just is no way that we can do it. It's we can't throw resources at it to make that work. So Drupal is your default content to the browser. You have a front end strategy for the personalization of that content. We're offloading our search and anything chatty to NoSQL. So that's where that's our elastic search strategy and additionally making sure that Drupal's not in the critical path for that. So even though we can create a view that talks to elastic search, generally speaking, we don't want a monolithic PHP application to sit in the middle of that critical path, that hot path. And keep an eye out for things like facets, personalized results, free text searches when features are being built out. Yeah, I'll just reiterate on the moving Drupal out of the critical path. I'll use our setup as an example. You've got a high availability cloud front CDN sitting at one end. You've got high availability, scalable elastic search sitting at the other end. Please don't put a Drupal application running on PHP instances sitting in the middle to talk between those two because that's where it's going to fall over. You want the cloud front going straight to elastic search back to the user and just avoiding the load even just in terms of like the latency that you'll get by passing through Drupal. Are you being much better off going straight to ES? In the cases where we do do it, they are only situations that would cache in cloud front as well. So if we do that from Drupal at the moment, it only works because that return response is going to cache. And otherwise it wouldn't work. Just to, yeah, just to summarise. So to summarise our no SQL strategy where it's what we've decided is that our elastic search is really going to be for listing content and for things like auto-completes related news that type of thing and not for indexing full pages of content. So just to clarify that we have a separate strategy around providing a full page via a GraphQL endpoint so that we're really focusing our dynamic search chatty queries on hitting elastic search for a list of things when you want it to happen really, really, really fast and you want your payloads to be really small. As far as keeping these like a lot of the solutions that we have to build are around like dynamic listings of things, an example is roads and maritime have a search for waterways, for marine notices around waterways and they have a CSV file. So what we're looking for here is the CSV file isn't managed in Drupal, they already manage a spreadsheet, they manage some other JSON endpoint, they have an API, they have a CCAN dataset and we don't want to create MCs in Drupal and put Drupal in the middle of that process from an editing point of view, just from a complexity point of view. So this project data pipelines is one that we've sponsored and we got great support from Leah previous next and other developers working on it, shout out to Ken for example in our team. So we have, so what happens is that the editor can, basically we can define a validation in YAML and editor is uploading a CSV file or pointing to an API endpoint that gets validated but then it gets pushed into elastic search as an index data set. And then the front end can just knock themselves out. They simply, it becomes a front end problem. So any sort of front end developer is simply just getting a JSON feed from elastic search where that they can manipulate in any way. I'm just gonna jump, I'm just gonna be quick because we're running out of time we've still got a few more slides. The point of data pipelines here is as we said, get Drupal out of critical path. So go straight from browser to elastic search. We don't want to rely on the other person's API because occasionally their API goes down and more often than not, if we send two million people to an API on a certain date their API will not cope with the traffic. But also from a security perspective we want data pipelines to sit in the middle so that we can remove cross-site scripting issues and things and have a consistent way for our React apps to connect to the data use the data so that all the developers know that if I build a React app and I connect to our ES I can use this data pipeline to do A, B and C and there's not 16 different APIs that they didn't have to do with it. I think this is a final slide. This is just also talking about, I mean, there should be a no-brainer based on what I've already said but you're serving a page from Drupal and now we have a strategy around dynamic components. So building out React widgets that then can say talk to elastic search for filling in that part of the page. Yeah, and we do have a small strategy around when we render a page, we render a page with default content as I said earlier. We render, we even render the React components in some cases as a basic display and then we progressively enhance them with React and JavaScript. The point of that is so that if there's a IE bug not pointing anything out specifically that means the JavaScript won't load, the general public will still get content and they can still interact with the site. It's not completely broken as such. So that's all the slides that I've got for you. Thank you for your time and I'll hand over to Scott to see if there's any questions. Yeah, please put any questions you have in the live Q&A or in the forum and I'll pick them up and ask. That was awesome guys. Maybe I was just curious, you sort of alluded to this but the infrastructure is the primary thing that you wanna keep track of and keep an eye on but how are you guys handling observability for sort of the lesser known culprits like you mentioned like APIs that may not be down or what they may just be slow enough to where it appears down to the user or DNS or sort of those unusual suspects. So from that point of view, across our infrastructure we've put New Relic in a whole lot of places. So it's pretty common for me to jump into New Relic browser and say, have we introduced any JavaScript issues? Are there any pages responding slowly? And then we can follow up with those APIs. There's been a few times to be honest where the end point has been like, yeah, it'd be fine. And then once we put it live, it's not as fine as they expected it to be. So then we'll have to come through and root back to it to use data pipelines or store and cache in different ways. But that's, we just use New Relic for live monitoring and then checking and updating. All right. So Lee asks, do you have any statistics on how often content editors use the manual purge form? We do, but I don't know. So what we did, we put in logging on when things are cleared so that we can say, let's see if something's cleared once a day and we can pull that out and go, actually, that's probably a problem. Let's fix it. I haven't looked at the loading so I don't know, I've thought my head up. Yeah, I'd have to look at the logging, but I do spend a lot of my time in my, because I'm in the team where a lot of the training and a lot of editing happens. I spend a lot of time talking to the key power users. So they are usually kind of alerting me that they're having to clear a page and then I can then route that into tickets. So it's working pretty well because, but mainly it's me, talking to them about an issue and then I'll kind of go through and clear that page for them. So I'd have to look up the exact statistics, but so far it's going quite well because they have an organic way of doing it just in time if they need to. Yeah, the just in time was put in place purely for that. We know we want to automate everything. We don't want users going in and clearing CDN URLs. But we also know that we're not perfect and our algorithm may not catch an edge case. So that's what that's for. Time for one more quick one. I think we're good. All right. Well, if there's no more questions, guys, we appreciate the session and that was awesome. I think you'll make the slides and stuff available. And yeah, thanks a lot. And to everyone who came to visit, enjoy the rest of the day. Thanks everyone.