 So our presenter coming up now is Nick Sanamaria. We agreed to introduce him as an all-round good guy, and I will let him take it from here. Awesome. Thanks for that glowing endorsement, Nick. Welcome, everyone. So my name is Nick. I'm a DevOps engineer at the Victorian Department of Premier and Cabinet, and I'm working on the single digital presence platform, aka SDP. So at SDP we're hosting 23 production droopals, 35 Node.js applications, and anywhere from three to five environments for each project. So we've reached this scale in a very short period of time. Our first SDP sites were launched about two years ago, and now we're at around about the 60 mark. And that huge growth has kind of presented a lot of challenges for maintenance operations and governance. So if you're in here in this talk, I'm assuming that you're someone in a similar position, you know, you're a developer or a SysOps, maintaining, you know, maybe 10 projects or more, and you're hitting a lot of issues with efficiency and kind of frustrations in that. So there's a lot that I want to cover, but before I do, just want to briefly kind of frame the problems and frame the discussion. So every 15-minute job that you have to do on a single droopal site becomes an entire day at SDP with 23 sites. In a team of 20 people, you need to kind of be able to efficiently distribute knowledge. And that means if you build a snowflake, there's like a significant amount of additional burden to, you know, distribute that information between everyone on your team. And the poor ops guy who gets pinged at 3am now has like this, you know, additional cognitive load to figure out what's going on when they've already haven't had enough sleep. The the cruft that builds up on older projects is like an anchor that slows your ship down to a crawl. This technical debt comes in the form of modules that haven't been updated, old droopal patterns that haven't, you know, haven't evolved to modernize, and sort of ancillary stuff like, you know, maybe your CI config hasn't been kept up to date. Your local dev stacks have kind of fallen by the wayside. Your automated tests haven't been maintained. And the standard tools that, you know, are in the droopal ecosystem, they work pretty well for one site. But coming back to that idea of like, you know, one 15-minute job becomes a day, the same thing happens with, you know, like looking for logs, looking at metrics. If you're looking at your one site view, that becomes very inefficient. So, yeah, let's get down to some solutions. So automated patching is the number one most bang for buck thing that you can implement to improve your developer's efficiency and stop wasting time. I've seen agencies cut their fortnightly patching workload down by 75%. So where that would spend, you know, maybe like half a day every two weeks running composer updates, spinning up PR environment and testing it. They now come into work. There's a PR environment and a PR they're ready to review. They do, you know, quick smoke test. Boom, it's done. And this means that their critical security releases went out in a timely manner, and their developers were like way happier not having to do so much kind of monotonous grunt work. So to make this work, you need to adopt a few key elements. You need to use loose constraints. And the reason for this is that you don't get stuck on unsupported or old vulnerable versions. And then you have to make a really big jump at some point in the future. So if you're used to peening versions or, you know, being sort of tightly controlling what is going out, automated patching isn't going to work very well for you. You can also minimize the risk of these loose constraints by updating and deploying frequently. So this will ensure that like the delta, the difference between the last deployed thing and the next deployed thing is as small as possible. And that'll make tracking down any regressions very fast. I would suggest following the Drupal core release window, which is every fortnight on a Thursday morning in Australia. Automated tests. Gotta have them. You don't want to be going and manually testing every feature for every patch. You just automate that stuff, see a green build and be good to go. And, you know, but I guess in often there are business stakeholders who want to be able to manually test that things are still okay. So poor request environments are absolutely invaluable for that. Now let's talk about different ways that you can do this. I've seen two approaches. We'll call them centralized and decentralized. So for decentralized, we're not going to delve too much into the details because Kim Peppa did an amazing light and talk at the Drupal Meetups about this exact topic in previous next system for it. But the kind of the guts of it is that you've got a circle CI build that runs on a schedule and that does your composer update, drush update DB, config export and then pushes a poor request and that poor request then fires off their usual automated test suite. So that's a great system that I think you should all go and check out. This is linked to the YouTube video on there. The other approach which we're using at SDP is a centralized automated update system. So we are using Ansible to orchestrate this. So we have an inventory file which gives us a full manifest of all of our projects with parameters that maybe define any special behavior or place holders that needs to be replaced. Ansible then runs more or less those same tasks which is like composer update, config exports, pushes up a poor request and circle CI does our tests for those. So there are a few trade-offs when going with a centralized system versus a distributed. So with centralized tests like we have one upgrade script, one project for that. So that means we write it once, we maintain it in one spot and it runs everywhere. It also allows us to centrally manage our kind of ancillary tooling so Docker files, local DevStack, Lagoon config, circle CI build. Those things are all really tightly managed and for us that's key because most of our sites are more or less the same code base. Just managed in different papers. And yeah, and because you're running it in like a single Ansible script and you get a nice report of all of the successes and failures, it's really easy to spot when something has not worked. Sometimes when you have a scheduled circle CI build, it doesn't actually report anywhere if it failed. So unless you set that up specifically, so yeah, that can be a little tricky to track down. On the other hand, the downsides are that while you're getting it up and running, which is kind of where we're still at, it lends itself to a manual process to run that Ansible script, which means that it takes quite a long time to run where someone's kind of, you know, maybe running it and doing something else while it's running. Not the most efficient system, but we do hope to get that into a pipeline soon. And just because it is like, you know, it's more complexity with something orchestrates all of this activity that is a kind of more complex system that is more prone to breaking. Cool, but we're next going to move on to another topic. So this called configuration as code. And you're all probably doing this to some degree, you know, Drupal's config management system is quite nice. And it's great because, you know, all of your config changes are going through the same code review and get workflow as is your code. However, like these days, a lot of hosting platforms are giving you the option to configure environment variables for a project or for a specific environment. And this is great because you don't have to make like config split do this kind of work, you know, you can just have a read an environment variable in settings.php and that works regardless of where it's running. But problem with that is it does break the pattern of storing your configuration in code. So STP we've recently been trying to solve this problem. And yeah, the issue that we had is that application config was stored in like, you know, variety of systems. And trying to figure out where a value came from, why something was being overridden was like a nightmare. It also made if you know, we wanted to make a platform wide change, it was pretty much impossible. You know, there was no consistency where things were set. The manual process was slow and error prone. And there was just no simple way to verify if a value was set correctly. But we recently just built a new in-house system to address a lot of these challenges. And we've called it the configuration management database or the CMDB. So this system, all of our environment variables are stored in a YAML manifest in the CMDB project. And we have an ansible play that synchronizes these YAML values with the Lagoon API. So Lagoon will set all of those in the running containers. And this sync process runs when a pull request to the CMDB is merged. And it also runs on a nightly schedule just to ensure that like, you know, our desired state is the reality. And we don't have configuration drift going on. So this has given us a huge number of benefits. We now know exactly where values are coming from. Auditing is simple. You know, we can make a sweeping platform change in the click of a merge button on the pull request. Because the desired state is being actively enforced, we don't have to worry about, you know, configuration drift, you know, like things are just kind of get added and forgotten about, and, you know, do we still need it, that kind of stuff. And because we're using Git, you know, we get all the benefits of that. We have a full history and audit trail for every change that was made. There are a couple of downsides to this system. Like we, you know, occasionally have an awkward deployment where, you know, a deploy relies on a variable being set or having a particular value. We can't change it in advance. But, you know, we deal with those when they come up and it's been fine so far. There's a couple of technical limitations of YAML and of Lagoon itself. And it's quite a complex system, which is like, you know, this is now another system that we have to maintain. But at least that's now in the DevOps space, rather than our developers having to, you know, they can be providing value to the business rather than dealing with, you know, stuff they don't need to be. There's also a bunch of other ways you can achieve a similar goal. I love Terraform. It's a declarative language. And you could use this to create environment variables in Kubernetes or AWS parameter store or CircleCI. You could also use a GitOps model with Helm or even just YAML manifests and just, you know, do a Koof CTL apply when you do a deployment. So my main recommendation is just to avoid a hierarchical system, you know, where stuff is being inherited and overridden. And just if something is needlessly complex, then it's going to be a nightmare to maintain. All right, so monitoring is another big problem when running at scale. So with a handful of projects, you can kind of overlook inefficient processes like shelling on to a server to tail logs or run talk. We've all done that. But as the, you know, the remit of your responsibilities grows, you can't afford to be wasting time when you need to debug something and you need that vital information. And you can't have that fragmented across dozens of systems. And, you know, you can't have your team easily search to find patterns and gain insights into how your applications operating and how your users are behaving. So these are the things I think you need to do for monitoring a fleet of applications. So you must be pushing logs to a, like a central log aggregator. So there's heaps of managed services out here to do this, you know, paper trail, Sumo logic, Lasti cloud, more than I can name. You need to give your team every opportunity to find patterns and problems. So dashboards, dashboards, dashboards, you know, if you've got a metric and you can put it somewhere that might be able to correlate with some other bit of information, that's a possibility for your team to find a problem there. Putting metrics and logs on the same page can really lead to some interesting discoveries. So, you know, like, like CPU usage spikes, you cover over that, and it's showing you all the logs from that time. Oh, there, you know, there was some Drush command that ran, or, you know, there was a heap of requests to this one particular URL. You cannot afford to have false positive alerts when you're running a fleet of sites. If your Slack channel is just, like, full of unnecessary noise or your opinion, sorry, pager duty is going off for false positives, like, you lose the ability to respond quickly to real problems, because you kind of start ignoring things or like, you know, not thinking that things are urgent when they might actually be. And finally, monitoring can be expensive. Data retention and bandwidth can really add up, especially when you're running, you know, dozens of sites. And so things to think about here, some information is important to have in real time, and other stuff is not. So, you know, maybe find the right storage, depending on what use case you have. Use a tool like Fluent D to, like, sort your logs into important or, like, high priority and low priority, and filter out unnecessary stuff that doesn't really matter if you store it or not. And use your cloud providers budgeting tools to set cost thresholds and alert when you're getting close, because, you know, I've seen where someone's enabled a cool monitoring feature in AWS, and all of a sudden it's costing $500 a month. So, yeah, a few tools I'd recommend you look into. So, CloudWatch on AWS is a great all-round tool for logs, metrics, dashboards, alerts. They have, like, anomaly detection powered by machine learning. They've got great powerful log searching interface, and it's pretty affordable, like, you know, because you pay as you go. Datadog centralizes information from, like, more sources than you can focus. So, if you are putting data somewhere, you can probably pull it into Datadog. It is a little expensive, though. So, you know, you're paying for that power. I'm sure you've all heard of New Relic. Still rates are mentioned because in terms of application performance monitoring, I don't think it gets beaten by many. And if you want to roll your own stack, Grafana, Prometheus, and Loki are the ones to look at. You know, you can level customization you can do there is insane. And it's a Drupal conference, so I'll just name drop that Prometheus exporter module because that is pretty neat. So, this is a CloudWatch dashboard that, you know, you should aspire to something that's got, like, a whole bunch of metrics on it. It's got errors. It's got exceptions. And, like, that's all that you develop as fingertips. They go to one stop. That's everything you need. At SDP, we're currently modernizing our observability stack. So, this is going to be a one-stop shop for our developers and even our content team to come in and see, you know, what people are doing and, you know, again, connect those dots that we might not have otherwise been able to see. All right, having enough time. Oh, one minute. Okay. So, just going to quickly talk about access control. So, this is another issue at scale. Admin accounts. Like, if you have 50 sites, how are your administrators logging on when they need to debug a problem? You need a, you know, a federated logging system. So, like, LDAP, SAML, OAuth2, do this. That'll give you your two-factor authentication and means that when someone leaves, you don't have to go and remove the count from 50 sites. What else we got? Oh, yeah. For your hosting platform, again, federated logins, if you can, are back to block platform level, things like deployments and SSH access and pulling your audit logs from your hosting platform. And I think this is the final one. Has she caught boundary? I'm very excited about this project because it is, like, you know, basically a way to better bastion host, but does more than just SSH. Check out that project if you're interested in that. Yeah, I think that's all I've got for the moment. And I think we're just about on time. Cool. We are in the extension time, but I'll see if we've got any questions. Yeah, there was a couple. Cool. So what do we got? Is this all happening in discussion form? There's that live QA. There's like a little live QA. If you can't see them, I can read them out. Yeah, can you read them out? Because I can't see them. Yeah, first one came in from Carl. How do you handle secrets or tokens to prevent the plain text values from being committed to Git? Yeah, great question. So we're using KMS, AWS KMS, to decrypt and encrypt the values that are in the YAML. So we're just using an Ansible filter. So basically, we commit the encrypted value and when Ansible runs, it passes that encrypted value through the KMS filter. So that means, you know, all we need to do is give Circle or whoever's running Ansible, some AWS credentials that have access to do that. There was another one. Actually, next one, next one off the list was, are you storing data as well as code and config within your change management storage? Or should I say, anonymized or data and or schema? So In the config management database, are we storing any like PII? Is that what the data is supposed to code in config? Oh, right. So storing data in a version control system. We're not doing that. I've seen some other organizations where they'll like have a, you know, some sort of site install system where their content is stored as code. But we're not doing that. And I don't have any advice around that. Cool. And that's the questions on that note. We'll wrap it up. That's it. Thanks. Beautiful. Thanks for joining. Everyone really enjoyed it.