 Hello everyone. Thanks for coming into this talk. I'm Ben Chad, I'm one of the Senior Technical Account Managers at Acquia. And I'm here to talk to you about the experience that we had earlier in the year doing enterprise hosting and consultancy for the Commonwealth Games. So the title of my talk is Success at Scale Preparing Drupalate for Large Events. And my plan for today is really just to share with you some stories that we had with getting Drupalate ready for large scale hosting. Things that we did that worked well, some things that we might do differently in the future. But please do feel free to interrupt with questions at any time. I'm very happy to make this a conversation. And I am just mindful that this is a topic that's really dear to my heart. And I've spoken about it so much internally at Acquia that I might sometimes forget that I'm talking to people who aren't familiar with all the details about it. So thanks for having me at this Drupal camp. I'm really excited to be here. It's always good to get an excuse to come visit Singapore. I've already introduced myself and my role just to give you a bit of background on me. I've been working with Drupal since 2010. So if you think back to the Drupal timeline at that point, that had me dabbling in Drupal 6. And that site that I worked on was a lot of fun. It was getting promoted by Steven Frye all the time. I don't know if you're aware of him, but my first experience with Drupal was high profile. He is a comedian in the UK and back in 2010 he had something like 3 million Twitter followers. So whenever he would tweet about this particular site that I was working on, you'd get a stampede. And his media department would tell you that you'd count on something like 650,000 Drupal hits in an hour if he tweeted about your site. That was always a bit of fun. We made every mistake in the book with that site. So it was nice to have a chance to file quickly and have a do-over with future iterations of what we're doing. And I guess Drupal 7 is when I very first got my hands properly dirty working with Drupal and I've never looked back. I joined Acquia as a member of the UK team in 2015. I've been working as a technical account manager ever since that time and I'll tell you a little bit more about that on the next slide. But I had the good luck to relocate with the company back home and to Sydney in 2017. So yeah, as I said before, I'm just here to chat about experience that we had hosting the 2018 Comm Games. Now, I don't want to talk about me and Acquia too much but just so that you know the viewpoint that I'm coming from, I'm going to tell you about my job role a little bit and then that's going to reflect the support that we were trying to give Comm with Games. What we contributed was enterprise hosting and strategic Drupal Consultancy. At Acquia, we don't do site builds or we can but we generally have customers bring their own sites and we pair them for the delivering at scale side of things. With the Comm with Games, the engagement was primarily driven out of my department and I actually had the good luck to have a bit of a team who worked with me on this engagement. If you're not familiar about what a technical account manager type role would do, we're effectively business focused Drupal Consultants and we're motivated by customer success. So we're just trying to do everything possible to help the customer achieve their digital goals. We're there to maximise value in Acquia products and services and just act as a gatekeeper and ensure that the customer is using the product in the right way. The key headlines that I'm going in the very first time or two that I'll meet a customer, I'm saying things like I want to support, guide and protect the investment you're making in the digital platform. So the Comm with Games Organising Committee known as Goldock, they built and operated the site and they did the initial build with Sapient Razorfish and then took it in-house and managed it themselves. But Tams were there to act as a voice of reason and we spent 28 days with the customer around the event. Now that's going to sound excessive but when I start showing you traffic levels that they had to deal with it ended up being about the right thing. So I would think that in this part of the world most of you are familiar with the idea of what the Comm with Games is. Whenever I was talking to people in headquarters, being a US company, I had to go right back to the start and I'd be saying things like the Comm with Games is like the Queen's Olympics. It has 70 teams made up from 52 member states so you get things in the Comm Games where United Kingdoms split out and played four different teams etc. Six and a half thousand athletes and officials. But where things really start getting interesting to me is when you start thinking about what the global coverage of this is like and what that starts translating to by the time you get to approval page views and infrastructure and what the impact of that is. So as a strategic consultant as soon as I see things like 1.5 billion global audience I'm starting to get a bit sweaty at that point and I'm telling my wife oh I've just been handed this new assignment at work and it's going to be highly visible and a bit tricky. When you hear that you're going to have 100 million page views during Games time or 100 million page views over what's roughly a 10-11 day period that's quite a lot to handle. I can remember strictly from the pure developer days handing my manager a unit of work or a piece of functionality that worked and did the job was a very different problem to handing my manager a unit of work that had been load tested out and proved to be success. So we knew that we had a lot of due diligence to do to get the customer ready for this event. So the success story here's the bit that I love and it was great to be able to say this with a sigh of relief after the event. So in an 11 day period what we were able to deliver out of Drupal a result site built in Python and then Acquia Hosting was we delivered 3.9 billion requests visited by 8.4 million unique visitors at the Acquia Managed Edge. We exceeded projections on Drupal capacity needed. We served out 110 million plus Drupal page views and the only way that we were able to achieve that was by being really aggressive about our caching lights. And when I'm sort of thinking of the story that I'm going to start telling in the next couple of slides it's mainly about caching and how we made caching fly. Operationally it went off really well. We didn't have too many glitches at the time. There was zero downtime and I've got some of Acquia Support Team sitting here with me today. They didn't see this customer at all in a critical capacity during the event. There was no tickets for that at all. It was really good to be able to say at the end that there was one very happy customer. They felt validated in their choice of Drupal and that to us is a great story. That's what we're trying to achieve. So I won't show you the full video here but a colleague of mine put together this cute little visualization of what the traffic looked like. So it's a bit like playing Tetris. On the right hand side, the music's coming through anyway. I can't stop it. On the right hand side there's script names and each one of those balls coming across represents a request to that and then this Tetrisy Paddle represents the web server knocking a request off of that or the CDN knocking a request off. Now, Sean must have started putting this data together at the start of the day because it looks quite tame at the moment when you start thinking in that way. Now if I fast forward a couple of minutes into this it starts getting a lot hairier. And if you're playing that Tetris Paddle, you're moving really, really hard to serve all those requests. And then if I jump forward to the right place a little bit later in the morning it just goes, it's on fire. We were keeping an eye on currency requests as they would come in on dashboards and we had to set up a custom number. And around critical events in the Chrome games when there was high media interest that's what it really looked like. And if you don't achieve uptime that's at a level and you just play accidentally. So if I start thinking about the path to success this is not meant to be an Acquire infomercial at all. Please don't think that it is. I'm just listing out what they used and why they used it. We did make a decision to use best of breed software and infrastructure. Now obviously because of where I come from it's going to be influences to what we used. For Drupal development they started off with Acquire BLT and Acquire Lightning. Can I just get a quick show of hands who is familiar with BLT and has downloaded and used it? Yeah, not many. It's worth checking out next time you start on a Drupal project. I guess it's short for build and launch tools and it's just a good framework especially when you're working in teams for setting up some dev consistency and I guess just enforcement and practices to what your delivery pipeline might look like. So if you've got that recalcitrant developer in your team who is always refusing to get commit messages in BLT will enforce that you can't do that. It does other things like automatic code linting and checking that you're pushing code up through the stages of the delivery pipeline properly which is good. They used Lightning as a base Drupal distribution. We find that that's really good in media heavy type situations and we have dev teams reporting back to Acquire that it saves them about 20% of the time in a project if they use Lightning as a base Drupal distribution. In terms of CDN, we're big fans of Cloudflare at Acquire. It's fantastic. I've never worked with a piece of a hosting stack that is just so nimble and lets you roll changes out quickly. If you ever have a chance to go and play with Cloudflare's free tier I highly, highly recommend it. If I had to name one thing which was make or break in the success of the delivery here it was Cloudflare without a shadow of a doubt. We could not have achieved that level of scale without them. They used Acquire Cloud continuous delivery for the dev pipeline so if you haven't kicked the tyres with Acquire Cloud in the last 18 months or so we can now do on-demand environments, so effectively environments for pull requests that was a big win for them. Then of course Acquire Cloud Enterprise for highly available and scalable Drupal 8 hosting. One core value that we took into the engagement was value being pragmatic over textbook perfect. If you think about it a site like the Commonwealth Games it only has to work for a two-week period. We're not building a digital platform that has to last five years to get return on investment. It's just there and it has to operate in a keep alive mentality for two weeks. Now that doesn't mean cut corners but it does mean that as you're working as a developer you don't have to invest in the future so much. If you've got a design decision that you can do as a quick win and you know it only has to work for a short period of time, fantastic to do it. Then obviously when you're up in that range of 10 billion Drupal bootstraps in a day if it's cashable, cash it. Don't make your infrastructure do that work again if you don't have to. So what did caching look like for Gold Coast 2018? Well to be honest caching is easy, caching is really easy because as soon as Drupal generates a page view you run it through varnish on the way out and varnish will just suck it up and happily say yes, I'll keep hold of that forever and I will serve that out indefinitely for you. So what's the actual hard part of this? Why don't we do it all the time? Purging. Purging is really hard. Now to make life even tougher think about all the different caching layers that you have in place. You've got opcode cache and A to C cache and memcache and varnish and Drupal cache and CDNs and all this sort of ugly stuff. Where is markup sitting? Markup is really sitting in potentially up to three places. You've got your Drupal cache and then the more important ones from an infrastructure point of view you've got varnish and then your CDN. So certainly we had two caching layers upstream of Drupal. Acquia Cloud always has two varnish nodes sitting out in front of it. You can hack things up so that they're active-active if you really want to. The com games didn't feel the need to do that. We had them in an active-passive configuration so all traffic was running through one varnish instance at the time and we would only use the second one in a failover scenario. But the job of Cloudflare as a CDN was to take a Ferrari and make it fly. We just couldn't have achieved the concurrency needed for that level of hosting without a CDN sitting out in front to act as an amplifier and get this message out globally. So all right, I've got highly available Drupal. I've got varnish in front of it. I've got a global content delivery system sitting out in front of that. How do I react and purge when editorial make content changes? Well thankfully Drupal 8 is the first version of Drupal where it's awesome at purging. And that ended up being enough. Using the power of Drupal 8 was good enough in our case. Because Drupal 8 and varnish, they had content TTLs of 30 days, which was great. Now when you think about, you're saying that the com games was a two-week event, that basically meant that once a page had been generated once and was sat in varnish, barring an event which for some reason triggered a cache clear. It never had to be generated again, which was great. So they obviously had cache tags and that helps with the... It almost completely solves that how do we get content out of Drupal's page cache and out of varnish when we need to. Unfortunately, although Cloudflare do theoretically say that they do cache tags, they don't do a great job at it. I don't know how many of you have actually looked at the tags that come out of Drupal 8, but the list of tags can be immense. Especially if you've got very complex entities on a page, you've got a really long list there. Now Cloudflare cannot support cache tag headers any longer than 255 bytes. Acquia only discovered this in anger while we were hosting the Australian Open earlier in the year. What I did with the com games was very much a replica of the successful Australian Open tennis tournament engagement. When you take that 255 byte payload that Cloudflare can support and you've hashed your tags down to 4 byte tags and spaced to limit them, you've only got about 50 tags there. Now in the Drupal 8 world, it's very, very easy to get a Drupal 8 page which is going to spit out more than 50 tags at you. So for all intents and purposes, that meant that Cloudflare and tag purging aren't compatible. The solution was quite simple again and we just set 60 second content TTLs at Cloudflare. For our purposes, we didn't have to do anything more than that because that meant that when an editorial change was made, we only had to wait for a minute for it to work its way up through the system. If editorial really minded about that 60 second delay, we could always manually purge the page out for them. But we weren't doing any live results with Drupal or anything like that. We were using courses for courses and Drupal decided to not do the tool for results. So that ended up being sufficient rather than building a complex sequential purging mechanism on top. Now you might think, how does that work? How do we still survive on TTLs at the edge that are only one minute long? It's still got us the magical nirvana of a 99.9% cash hit rate. And we were ecstatic when we realised that we were hitting that number on an operational level. So a rule of thumb that the TAMs in Aquia tend to throw about is that 99% cash hit rates are achievable. As long as you put some method in and be mindful about what you're doing, you can get there, but you add in the extra 9 and 99.9 is hard. So all that we did was we did the achievable with two layers and you add them together and that gives you the element that's really hard. And so you put those two layers together that are each contributing a 99% cash rate and we got that 999 combined rate. Now to prove that, I actually had to go back to my previous career before being involved with Drupal and Aquia and I come from a mathematics background actually and I got out the old first year probability rules and if you do the math and look at complementary events, the probability of a miss is that thing which I wrote up. So if you do the 0.99s at two stages, they multiply together and they give you that 0.001% which was great. So we knew what our caching infrastructure or our caching solution was going to look like but we still needed to do a regular DevOps dance and what do I mean by that? Well every day we were looking to see what URL's cloud flare was pulling from our origin. Why did we want to look at that? Outlier counts represented infrastructure risk. If we knew that for some reason the slash news page was hitting our Drupal back end ten times a second, something had gone wrong in the caching layers, something needed to be reset and we needed to work out why that page wasn't being cached. At the start of the event we were going through this cycle up to two or three times a day where the Tams would come up with the list of the worst vendors, we'd analyse why these weren't potentially being cached and then we'd pass it back to the CommGames developers and they would help us sort it out. It was usually files being put in funny locations that needed proper caching headers. The original developers were, they probably didn't think how to work this politely, that code hygiene wasn't always great so they were putting assets in weird places and then when you think of the HT access directives, they tell Varnish when to cache certain things and those directives are going to be based on file type or based on location in the doc root or that sort of thing. So we used to have to adjust those. We found that there was a lot of instances of endpoints being accessed by post requests instead of GET. If you've ever done any high performance Drupal before, you'll know that's one of the easiest wins that you can make. Post requests generally aren't cashable by upstream proxies so if you've got a view that uses a lot of post requests, say for pagination, replace it by GET and the impact on your infrastructure is going to reduce quite a bit. We identified a self-denial of service attack before it truly ramped up and I'll talk about that on the next slide a little bit. And we also wrote a tool called Cloudflare CLI which I'm pretty sure that one of my colleagues, Sean Hamlin, who may be known to some of you here has made available as an open source project. If he hasn't done that by now on GitHub, I know that he's planning to do so really soon. But that was a game changer for us. If you've ever worked with Cloudflare logs, Cloudflare logs aren't the standard one line, means one message thing that we're kind of used to in the DevOps world. They return massive JSON blocks per request. So the first part of Sean's tool was effectively a filter and a formatter which extracted out the important payload from the JSON and turned it into syslog format. But then he did a whole stack of fantastic analysis on top that let us identify what these worst defenders were. So cache tags, I fell in love with them during this engagement. They're awesome, but they're not perfect. It's almost gone too far in the other direction as to what Drupal 7 was. Drupal 7 was very naive about knowing what sequential or dependent purging meant. Drupal 8 can sometimes just be a bit too aggressive about it. If you've purged a tag that is listed everywhere, you are going to clear pretty much all of your caches and get stampede elements going on. What was presented for com games was on the front page and then on landing pages. They had lists of notes where people would go to and maybe top news stories and things like that. And they were maintained by entity queues. As soon as a content manager went in and did a bit of drag and dropping on their entity queues, that would start sending Drupal requests through. And Drupal would see that this entity queue was being updated and it would dutifully identify that everything with the node list tag should be purged. Well, on an informational type site, virtually everything had the node list tag. So it didn't take very much drag and drop activity at all for us to pretty much lose the entire site out of warm cache just by doing a bit of reordering of news stories. So the pragmatic solution was to blacklist the node list tag. Drupal 8 does have a mechanism in the deep dark edges where you can tell it that when certain tags are invalidated, actually don't bother going and removing that from the upstream proxies. Now, what happens then? The organization that you're working with has to make a decision. How do you deal with the dependent content purging at that stage? For the Comm games, it wasn't a big deal really because the pages that they were updating with those entity queues, those updates didn't happen very often at all, and they were quite happy to just deal with that manually. Now, remember, we were only there for a two-week event. We just had to be pragmatic and not flashy and doing the main update was the way to go. So if I start thinking about a few war stories and interesting things that happen, the self-denial of service, I think, is almost my most amusing one. So I said that Drupal wasn't delivering results. Comm games had another partner. They had ATOS in there who didn't specialize in that. It was a product that they travel around with and tout every major event like this. But for some reason, which I didn't really understand, results borrowed markup from the WWW Drupal site. Now, if you're a cause wizard, you might be able to see what is coming next. We certainly hadn't predicted this before we went into the event with Comm games. It was partly because, pardon me, we didn't have any visibility into how results was built. But, yeah, they were borrowing markup from WWW. And for strange reasons, that markup was being delivered as a JSON blob coming out of Drupal, back to the results site. But what happens then is that because this was a cross the main request, the browsers get a bit touchy about security. So without getting too deep into it, the cause spec says that results has to make what's known as a pre-flight request back to WWW and say, am I allowed to make use of this resource? But the spec also says that this is strictly uncashable. So it can give permission, but it can give permission as a one-time thing only. Now, if you go back, if you've got a really good memory and you think of those numbers that I put up at the start, over 11 days at this Acquia Managed Edge, out of what was coming out of Cloudflare, results contributed about almost four billion, the combined total, sorry, was about four billion page views out of the CBN, whereas Drupal only contributed about 110 million of that. What that meant was that for billions and billions of page views, even if they were cached out of the CBN, it was sending an uncashable request directly to Drupal infrastructure. And that was just a risk that we couldn't accept. And we had to find a way of mitigating that with the ATOS team, and I'm really glad that we did. But that was something that we only saw in the ramp-up period to the games, saying the five days ahead of the event, rather than catching it right on time. I'm sure that many people in this room kept awake at odd hours dealing with Drupal Geddon 2.0. I wasn't part of Acquia with Drupal Geddon 1.0, and I know that Acquia took some criticism for there being a feeling that some of its employees had advanced notices to what was going on. I can honestly tell you that with 2.0, we had no internal knowledge as to what was going to happen. I had exactly the same information that everybody in this room did. And given the nature of the event that obviously scared us, and the Commonwealth Games had gotten themselves in a bit of a tangle of technical debt, they had stopped upgrading their Drupal too many releases prior. And ahead of the announcement, or the pre-warning of Drupal Geddon 2.0, they had made a decision to accept the technical risk of running on an outdated version of Drupal. And that was obviously... The thought of that was something else that kept me awake at night. And then when we knew what was coming out here, I had to go back to... I had to escalate it quite high into senior management at the Commonwealth Games and just say, you know, you can't go into this event with an outdated version of Drupal. You know, your developers might be saying that the combination of BLT and Lightning and Drupal 8.3 was just making it impossible to get that upgrade done, but you need to find whatever resource is needed and pay whatever is needed to get this upgrade unblocked. And, yeah, we finally got there, which was good. But, yeah, I mean, I remember that we all had to turn up at the operations centre at something silly like 3 in the morning to ensure readiness for that upgrade coming about. And there was an interesting moment with regionalised live broadcast feeds. Now, the folk at Com Games were normally a pretty honest kind of bunch, but one of the product owners told a little lie. And the product owners assured his managers that he had built in capability where if people visited Slash Live, depending on where they were in the world, they would be sent to the local broadcasters live feed that event. So, you know, if you're in Australia, you'd get Channel 7. If you're in the UK, you'd get BBC and so on. And I remember one of the developers coming up to me in a mad panic about three hours before the opening ceremony. And he said, our product owner deliberately hid a requirement from us and never told us about these live broadcast feeds. And Channel 7 and all the other executives are upstairs demanding to know where it is. And I was like, well, okay. You know, we've got country headers out of varnish. You know, we can tell you where a user is located. But you've got three hours to do this and you haven't thought about cashability, have you? Yeah. So, yeah, luckily, there was a win out of this. And, you know, I said to the guy, look, well, you know, as an Acquire Tam, I'm not allowed to deliver you code because if I do, I've got to own it and be responsible for it for the rest of the time. But if you leave me alone for an hour, I'm going to write it up for you and then you're going to, you know, look over my shoulder and see it on my screen and go and use it. And I was lucky, actually, because, you know, I'm nothing special with very headers, but I'd solved a similar problem for another customer in recent time, so I knew the method and knew how to do it. And if you've never used a Berry header before in your life, Berry headers are magical, magical, magical things. You know, they tell proxies like varnish how to look up a cached object using a key which is more complicated than just URL. So, you know, the Berry header in this instance was telling varnish how to find a cached object based on the combination of URL and country code. But there is one thing to be really mindful of when you're doing that in a group-late world while things like varnish and Cloudflare or any well behaved CDN are definitely going to obey a Berry directive. Drupal's page cache has no notion of that goal. Drupal's page cache can only vary on URL, essentially. So, we were lucky that we could achieve this effectively by using cacheable and variable Apache redirects. So, you know, someone would visit slash live, and then without falling back to Drupal, Apache would redirect them off to, you know, slash live slash au or slash live slash gb or whatever and still achieve that cacheability. Yeah, but have it done in a variable. Yeah, and the other thing, I won't go into massive detail about it, but we did start getting issues with the development pipeline close to the games. You know, we all love config management, but these guys took config management to a completely new level and had issues with our build containers and why they wouldn't complete a build of a site in sufficient time before timing out. And they had to have a serious conversation with them. You can't have 1,700 YAML files, okay? If you're going to do config management, you've got to tame it down a little bit more than that. So, again, with that close to the event, you didn't come up with a long-term strategy around it. But, yes, as you're doing Drupal builds, just be mindful that config management is great, but you want to make sure that your config files aren't being too much of a tame thing and being separated out. Yeah, I'd love to open it up to questions. You know, interesting feedback from you guys. Yeah, just a quick shout-out. We are hiring in Southeast Asia at the moment. We've got open headcount for a solutions architect up here, which is a fun role. If you'd like to know more about that, grab me today. My conscience is clear. I can go back and tell the people that I've made that little inter-mercial to come and work with us and have fun travelling around. But, yeah, please, any questions about the event? It was roughly a nine-month project, so they did it pretty quick. Yeah, so I think there was always some minimal presence for the 2018 games after the 2014 games had finished. And they had a rough and ready Drupal 7 site in play, but there was never any performance tuning or anything like that done with it. They took on Acquia technical account management in July of last year, and one of the first things that we did with them was help plan out this Drupal 8 rebuild. They would have pushed that live November of last year, and then at that point it was just a matter of keeping it in operation. And the strategic consultancy side of things and making sure that they could keep uptime on the infrastructure they had, that was really a three or four month engagement. Yes, the whole Drupal component was. So anything under WWW was hosted on Acquia Cloud. The results site was hosted on a custom ATOS solution. I think, to be honest, that they were even running their own bare metal and putting stuff underneath it. But yeah, we are effectively a reseller of Cloud Player, and we managed the Cloud Player side of things for both of those sites, so the non-Drupal site as well. We did. So load testing as a consultant is something that I have a bit of a guilty conscience about, because we always badger our customers into doing it. But then what that effectively means is that they load test, but they don't understand the hypotheses that they're testing for, or they don't quite know how to interpret the results either. So I think they did do load testing. I don't think it was entirely representative of what was going to happen come the games period, but the load testing they'd done was certainly enough to convince me that they had mis-engineered and think to the point where it wasn't going to perform very well. And if additional infrastructure was needed for a two-week period, it's not going to cost that much. When you're looking at a contract, the value of what they had anyway, if we'd had to double the size of their databases and multiply their webs by four, the additional spend wouldn't have broken the bank. The solution was to effectively take cause out of the equation. And this is where Cloudflare was beautiful. So I had a page rule set up in Cloudflare that whenever the request for... Let me get this straight. Right, the resource was sitting on dub-dub-dub. It was coming from requests. I had to ask the team who owned requests to make the markup ask from requests. Then in Cloudflare, whenever there was a request for that endpoint, I rerouted it down to dub-dub-dub. So as far as the browser was concerned, there was no cause activity going on, but we'd just done a bit of intelligent rerouting at the CDN level, which enabled them to get the resource from the original place. That sort of thing is one of the reasons I love Cloudflare. It's fantastic letting you do that. You hear all sorts of stories with other CDNs that start with A. Where to get that functionality enabled, you have to jump through a lot of hoops. Very little. The only integrations that they had all happened at the front end anyway. So Drupal didn't have to worry about managing that data, nor was it really hit from an infrastructure point. Yes, we did, but I guess that was more of a CDN type. But we always knew that that sort of thing would happen anyway, and from an Acquire point of view, that's part of the value proposition of taking on a CDN where you can minimise that. For example, Comm Games, without wanting to generalise, but say in Africa, a lot of the content is going to be downloaded by slow 3G connections and mobile phones. There's all sorts of smarts in Cloudflare where that can be minimised. So that's a win for sure. Right. We did, but only for disaster recovery purposes. So I've been given the roundup now, but feel free to catch me in the tea break and I'll tell you how that works. Okay, thank you very much.