 Okay, welcome to Handling Power Well Night with Patch, and in a way also with Pantheon and Fastly. And tonight we're gonna be talking about an extreme traffic spike that happened for Patch.com, which is one of the largest Drupal sites in the world, and how we worked to build infrastructure over the course of a couple days. That would be able to handle the next spike, despite not knowing exactly how big it would be. So today we can introduce the two presenters. I'll go ahead and start. I'm David Strauss. I'm a CTO and co-founder at Pantheon. We run Drupal infrastructure for development, testing, and deployment of projects, including Patch.com. And this is a Brewster who is the CTO at Patch. Would you mind introducing yourself? Sure. Those microphones work as well. I'm gonna use those. I'm just, this is my first Drupal con ever. I've been working in Drupal for about six or eight years, but under the radar. And it may well be my last, if this talk goes poorly, but anyway, let me know when you want me to just start talking. Yeah, we should actually, we should just bring you up here. Right. To kick things off. So I have actually paper notes. So yeah. So you don't know me, and if you do know Patch, you probably know Patch is one of the more colossal failures on the internet. And I wanna take one moment to say, rumors of our death are exaggerated. And second of all, to sort of introduce you to the problems of Patch, which, and articulate them to sort of create the foundation for the story of these four days that we're gonna talk about. I'm under no illusions as to who you want to hear from. It's this guy over here. So I'm gonna keep my remarks as quick as possible and then hand it back to David. But so, Patch is a hyper-local news platform serving about 1,000 towns across the country. It was originally owned by AOL, spun off under shareholder pressure in 2014 and has been running on its own steam since it was moved to very quickly move through the architecture change. It was moved off of AOL's infrastructure at $300,000 a month. Infrastructure spend of $30,000 on AWS. Partially thanks to Matt right there, who worked on it in Calgary with the Hale Global Team and then we moved it over onto Pantheon. And we walked straight into the hyper-local paradox is what we call it at Patch. And I just have to hit a button. So these are typical stories you might see on Patch which illustrate this paradox, which is that if I were to smell smoke in this room right now and I yelled fire, right? That would be extremely valuable information to everybody in this room, right? And then probably pretty invaluable to people down the hall and not too valuable for people down the street, right? And that's the problem with Patch. We have information like this, you know, you return a Chihuahua to their owner, that is extremely valuable information to that group of people, right? Whether the schools are closed in Ridgefield is irrelevant to the people in Westboard. But so we have with decreasing traffic, Ridgefield, Westboard, people who own this Chihuahua, right? We have increasing value of our information. So what that means that we have to do is serve an enormous histogram of content daily across our network. So we serve on the order of, I think I have a next one. I have to define how big we are, but I just hit the wrong button. I'm hitting buttons, David. There. That's today. So we're running 7,600 concurrence, but on chart feed there are about 200 pages of URL being served at any given moment on patch.com. It's got an enormous footprint that has to serve at very high availability. And our database is big as Matt knows and as David knows all too well. We have about 15 million stories, articles, nodes in our database, 6 million users, 44 million images, something like that, that we are keeping track of and serving at any given time. And at very high availability. And so we built this architecture, the sort of what Jesse at Pantheon called decapitated Drupal architecture, which means we have Drupal on the front and Drupal on the back and the sort of service layer separating them. But it allows our front end to do a lot of the sort of processing and minimally sentient stuff that we needed to do while serving stuff that is cashed in large scale across the varnish infrastructure in between the front and the back. So what this is allowed patch to do is if I get the right button, start jumping the fence in certain cases to serve stories of national import. And this is not just the only one. We do really well on election nights. We do all sorts of things. I mean, last two nights ago, Listeria shot us to 13 or 14,000 concurrence. We had no idea that was coming. And we have the hardest working editorial group in show business, really. Deploying all these articles across and hyperlocalizing them across our entire network of sites. But, excuse me, in picking up these national news stories that comes with a price, which is that you start running into scale problems pretty quickly. And so this is the January 13th story. But on January 9th, it was a Saturday night like any other Saturday night. I was sort of hanging out, having dinner, whatever. And I got a ping from one of our QA people saying, you know, by the way, we're running a little hot. And I said, why the heck are we running hot? He said, well, we're running at 10,000 concurrence on a Saturday night. That doesn't make any sense to us. And I said, okay, that's so we are. And I watched, and I watched, and suddenly it was 15,000. And then it was 20,000. And it was 30,000. It was surreal. It was going up like mad. And at 40,000, this is over the course of like 15 minutes. And at 50,000 concurrence, we're all like giving each other high fives, excellent. And then Batch went dark. And David was the recipient. And I think Ari, one of those arms-length bug reports from me, which no one really likes to get. But we recovered 20 minutes later. That's the last surviving screenshot from that night. You may notice that the meter has gone around. Which was, yeah, we didn't know. There were certain points, and then I actually, when Chartbeat's going like this, and all of a sudden it just rocks. Yeah, so thank you. What? Chartbeat would just vanish. And then it would come back, because it didn't know it had to reallocate servers to us. And so we started dealing with, I didn't know how to, this was New Relic. I would like to point out that our transaction has 150 milliseconds normally. And so we're coming out of there at normal speed. We bootstrapped into our content at 150 milliseconds. Our servers are fine, everything's fine, except for our throughput goes to nowhere. Right? And so we didn't know quite how to interpret this. We had our, we had two clues. Huh? Thank you. We had two clues. One was that. And the other is that in our infrastructure, in this decapitated structure, we have our main service box, and it talks to our front end to help it clear its own varnish cache very programmatically. And very granularly, because we can't just blow the entire cache for Connecticut when a new Connecticut story comes up. We have to be very careful about where we do it. And so it was having no trouble talking to a sale through and other providers, but it couldn't reach the front end at all. And all the timeouts were at 499 milliseconds, which is our time to connect. So we got some sense that, guess what? We couldn't get to the front end. Couldn't even get there. It wasn't that the front end couldn't serve it. And that was basically the one fact that I managed to get into my bug report to David. The other fact, or two, was that no one won Powerball. And it was happening to a surety the next Wednesday or Thursday night, right? So we were basically on the clock. And I think this is probably where I end and the person you want to hear starts talking. So anyway, to David. So when you build out a site, you typically start by having a basic architecture where you have browsers connecting to the application server. And then when you're ready to scale out for bigger amounts of traffic, you add in a load balancer that then balances your requests between multiple application servers, multiple backends. This is now allowing you to evenly distribute the traffic to those bottlenecked boxes. And you do it with something that has a very light touch, like a load balancer, where it doesn't have to do much with the traffic other than send it to the right backend. The problem is that this can be a bottleneck in itself. And if you're processing enough traffic, like a national news story, adding a load balancer isn't enough to actually scale your traffic up for that sort of environment. And indeed, when we started looking at the actual logs of where things had actually peaked out, it wasn't the application servers that had gotten, that were falling over. We saw network traffic basically spike up and then sort of hover around this jittery edge that you start seeing when you're working with TCP traffic and you've maxed out the connection. And even if you have a gigabit or two gigabit connections for stuff, eventually it maxes out. And that's sort of the story of that night, that once you start having that kind of national news story, you really can max out that sort of connection. And then a load balancer's not enough. So this is what happened as we were working on that. Abe already told you about the downtime once they reached about 50,000 concurrence, which is about 10, or not 10, about five times of sort of baseline traffic that would be typically expected to reach. And then around Sunday is when he filed the report with us. And when he says arm's length, he doesn't mean like a separation. He means like the text was that long. And then once we actually got back to the office, I think, I don't know if it was a holiday Monday or whatever reason, but we started really diving into the technical issue on Tuesday, finding out that it was a load balancer bottleneck and looking at our options. And then we knew that we actually had to implement something by Wednesday because Thursday was the next Powerwall drawing. So, and also I wanna emphasize here that we had an unknown expected number of concurrence because just because the infrastructure died at 50,000, doesn't mean that you actually can say, well, if I do double that, it's gonna be fine. So, eventually you have to take paths to even more scale, web scale. I don't know, how many people have seen this cartoon? Okay, great. For those who haven't seen it, it's a cartoon mocking the term web scale because of how a lot of people put it in place when it's premature or they're using not very reliable technologies that just run really fast. But this is actually an incident of actually requiring web scale for delivering this sort of infrastructure. And we went through a bunch of options internally at Pantheon. The first option we considered that we could graft on top of our existing infrastructure was something like round robin DNS where we deploy a few more load balancers, we put all of those things into DNS, upstream our DNS will send out a fresh IP and the rotation each time a new request comes in. But there's a problem with this approach because you don't have any guaranteed balance. Even if you have all those IPs rotating, a service provider like Comcast could very well get one result from you, cache it and not necessarily follow the same structure for rotating it, causing everyone from a major ISP or origin to be accessing the same load balancer all at the same time. And given that DNS caching is often provider by provider, that lumping can be really, really un-uniform. So we decided against hacking this approach into place. It's not an approach that we even really support on the platform anyway, we would have had to kind of go on the side to implement it. The next option would be try to pick up the red phone to our service provider and say, let's buy a gigantic load balancer. Do you have any 10 gigabit F5s lying around? And the answer is usually no, by the way, because 10 gigabit F5s start at $30,000 a piece and if you wanna implement them redundantly, you need at least two. So they don't typically just sit on the shelves too much and often, I mean, red phone not withstanding, they often take weeks to deploy if you wanted to call up a service provider, have them put together a proposal and then you end up with it in a rack of hardware. And then there's the whole time for configuring it and questions about how well it'll handle other load profiles like HTTPS termination. And ultimately it still concentrates the traffic all into one data center, which has its own kind of issues with being able to scale out the traffic and performance. The third option we looked at was skipping the load balancer and going right to Pantheon's Edge because we deploy varnish nodes on Pantheon where on the Pantheon Edge for every site, it comes with a set of three nodes that both handle varnish and then route back to backend containers. And then what we do is unless someone's implemented HTTPS, we typically send the traffic directly there. Each of those boxes has a multi-gigabit capacity but at the same time it could be lumpy with the distribution. And ultimately this wasn't going to get us through more than just power ball night because while patch.com currently is HTTP based, they've been deploying more HTTPS things and eventually we'll need to go more and more HTTPS. So a solution here that doesn't actually account for the next few months of roadmap and can only be used in this sort of emergency circumstance is not a very appealing option. And it would have even been a regression for even that night because even today patch.com can be accessed over HTTPS, it just redirects you and we wouldn't have been able to even handle the redirects with this sort of use case. So we sort of ruled that out too. So now we're heading to CDN territory. The initial CDN architecture that we looked at was the standard one which is where user agents get routed to a point of presence that is geographically local to them and if that point of presence misses its cache then it hits back to the Pantheon origin. This would meet the use case and it would have been perfectly happy implementing this but it isn't as good as you can get because you can also add in an origin shield. And this is supported by many major CDNs including Cloudflare and Fastly. And what this basically does is now we have a few layers of caching. We have the local point of presence which stores it geographically local to a user who's accessing the content. Then you have an origin shield which basically forms as a backstop right before the CDN actually connects to the origin systems. And what this does is it aggregates all the misses from all the different points of presence around the world into this origin shield and then if the origin shield has it according to the rules for the CDN that means that it doesn't have to reach the origin at all. And the great thing about this is that you typically can configure the caching lifetimes and other configuration consistently across the whole CDN including the origin shield and points of presence without a lot of extra effort. So you almost get this additional cache hit for free. It would have been probably fine to send all the points of presence to the origin but with a national news story you're working with quite a few points of presence that are going to be getting traffic coming in. And so each one would have to miss on each piece of content before they would be locally cached. I'd like to add just one thing to this too is that at 11 o'clock the numbers are announced. So you have to be able to purge these kinds of things as well. It's not that they're just gonna sit out there. And the other 4,000 stories on patch or 3,000 stories on patch of the day you get about 1,500 new ones in a day are all being updated at the same time as well. So everything is volatile in this situation. Yeah, and so this, we needed to accommodate that particular thing into the architecture as well. So we went into sort of looking at CDN selection and our priorities for this sort of implementation. Now if you have much more lead time your priorities might be a little different but in our case, we needed to find a CDN that's ready to handle top 100 levels of traffic. There are some really interesting CDNs out there that are kind of newer on the scene like key CDN that have a nice feature set but I don't know if I would wanna throw a top 100 website at them overnight. They probably wouldn't be, that would probably blow their capacity planning. We needed Drupal compatibility in the sense that we needed to be able to specify the rules that Drupal needs to operate properly across the CDN. We needed to be able to do this in a reasonable amount of time. And this, and I say reasonable amount of time because when you work with a provider like say Akamai, which has a very fast CDN and great coverage, their rule sets basically require that you work through them to implement updated rules. And that makes it very hard to implement something like Drupal that has complicated cash management rules because the round trip through their sort of professional services team to implementing rules and then refining the rules means the minimum implementation for that sort of thing would be weeks and we didn't have weeks. We needed to have HTTPS support all the way for running on the Apex domain. And Apex domains means something like example.com, whereas a third level domain would be www.example.com. And some CDNs only offer HTTPS for the third level domains because you can do things with C names with them. And then different CDNs have different approaches for supporting the Apex domain. And because patch.com literally runs as patch.com not www.patch.com, this was something that we wanted to have going forward so that we could not only select a CDN for the next couple of weeks, but ideally for the next few years. The 24 hour turnaround is a big deal. So I mentioned a bit on how that impacts things like Drupal compatibility or capacity planning, but we didn't wanna be the biggest customer of the CDN we were taking on. We wanted to be another customer. And we also wanted to have an eye toward the future for cache tagging and invalidation. How many of you are building sites on Drupal 8 at this point? That's like 20%. I mean, that'll continue to grow, but Drupal 8 can now tag content where rather than invalidating by URL patterns, you can send surrogate keys down to your edge cache and then generally an edge cache supports the ability to invoke clearing a specific key. And what that means is that every time, say, a node appears as the node view or in a block on a page or on a listing on a page, you can do something like tag it with that node ID, and then whenever you update the node, you can invalidate every page that has anything that was derived from that node. And it allows very precise cache invalidation and it doesn't mean you put one tag on a page. If you include five nodes on a page, you would put five tags on the page so that if any of those nodes are invalidated, you can invalidate the edge content. And also we wanted as additional insurance in terms of I've talked about security from defense in depth and I think there's such a thing from a reliability perspective of engineering in depth. And what that means is don't just have a plan A for your engineering in terms of your customer experience, have a plan B as well. And so on CDNs, including Fastly and Cloudflare, they have options for fallback to their stale or cached content if the origin or backbone is non-responsive. And what that means is that you can retain a copy of it on the CDN for a considerably longer amount of time than you would have it served normally. And then rather than giving your users an error or a blank page or something worse, you end up serving them at worst stale content. And that's a much, much better thing for your traffic, ad revenue and brand. So what we ended up doing was kind of a sort of clever approach to minimize the amount of integration we would have to do deeply with the CDN initially. So what we did is rather than try and move all kind of cash and validation logic today to the CDN, by today I mean when we were implementing this in this short period of time, what we did is we kept the long TTLs and invalidation on Pantheon's edge with its varnish cluster. And then what we did on Fastly is we implemented a 15 second TTL for the content, but we told it to retain it for a full day just in case. So what this meant is that if we invalidated something at Pantheon's edge, then it would be fresh to CDN users within 15 seconds. And that was a small enough window of time that even updating things like a PowerBall results story, we're not gonna be that dramatically affected. And then the one day retention meant that even if there were say an hour or two of inability to access the origin or a backbone connection, it would be able to use its extra copy of the content and then serve that up so that the users would still see something. Fortunately we didn't have to fall back on that, but it's nice to have that there. So how do you move to a CDN, especially when you have a really limited amount of time? So in our case, one of the first things I did is worked to get the CDN to terminate HTTPS because we wanted to make sure we could handle the redirect and would be ready for that future traffic later. It tends to be kind of hard to throw HTTPS into the mix if you don't account for it a little early in the process because sometimes you end up with aliased pages for the HTTP versus HTTPS versions and it's really just better to kind of get the handling in there even if the only purpose of that handling is to redirect to the non-encrypted copy. The reverse is true if you go HTTPS only. It makes sense to handle both cases and make sure the right things are in place early. And then we made rules to properly handle Drupal. This typically handled things like pass-through for active sessions. If you've ever looked at a varnished VCL file, you see things like if there's a session cookie, then do pass rather than try to hit the cache. That's pretty much required for any sort of content management system. And keying on HTTP versus HTTPS, which means that when we return a different response for one versus the other, that it doesn't aliased. Like you can get into a redirect loop if you don't distinguish when you cache because if you're redirecting from one to the other or back, you definitely don't want to do something where you load the HTTP page and then you send a redirect to the HTTP page that's cached from the last time you handled an HTTPS response. So those were sort of table stakes for just getting it to work and not be broken. And then we created a test domain that routes through the CDN. Like on a service like Pantheon, you sort of just kind of add an extra domain on the dashboard and then if it's a subdomain, you just tell us your CDN about it. And what that does is it now allows you to inform other people from the organization, hey, here's something to hammer on. And this should work for you. And they can do things like log in, log out, view pages, make sure they cache. You can make sure that retention times are working properly. And you can also make sure that all of the kind of interactive things for once you're logged in and say editing content and then viewing it, that all the pass through happens right. Also, if you're doing redirects between HTTP and HTTPS, making sure that all works correctly and that you don't get cache aliasing. And that's sort of what I just mentioned in terms of validating all of that through your test domain. In Patch's case, there's a lot of custom infrastructure in terms of the headless Drupal, various other services implemented through Drupal, and those all interact even through Varnish themselves, although not through the CDN. So we wanted to validate that all of that was still working too. And then really it's cut over time once everything looks great. And the best way to cut over, because DNS can be finicky, especially with long cache times, is you want to drive that TTL way, way down. I know that a lot of DNS systems charge by the number of requests, and when the TTLs drop, you pay a proportional amount more, but for the transition time, it's totally worth it. So you reduce it, like maybe you had a 15 minute TTL before, really reduce it to like 15 seconds, something where the cut over can be pretty reliable, pretty quick. Not everyone running a recursive resolver downstream is going to totally honor your cache times, but they tend to be close. And then you want to wait out the old TTL. In fact, you can start this process well before you're ready to actually cut over, because this doesn't actually cut over. It may reduce performance a tiny, tiny bit for hit rates to DNS caches, but so if you have a 15 minute TTL, you do this change at least 15 minutes before you're ready to actually cut over, because you want all the downstream caches to be flushed, everyone to have those little short copies of that DNS. And then once it's a super short time, you can actually retarget to the CDN or alternative infrastructure, and that should implement, that should roll out really quickly, because if you have say a 15 second timeout or TTL on your DNS records, then it should only take 15 to 30 seconds after making the change before you start seeing the traffic really flow into the new IPs. And then once you're actually happy with all of that, then you can actually validate the production functionality and return DNS to the original TTL. The other purpose to the short TTL is that it also makes a mistake less costly, that let's say you get the configuration wrong or something was not set up properly in the CDN and you need to go back to your original origin, then you can just toggle it back and there's just as little time to toggle things back. You really don't want to be having a high traffic site and then get the configuration wrong and have to wait 15 minutes before you can possibly fix it. But then once it's validated, you can bump the TTL back up in this sort of structure. You've only spent an hour or two at the low TTL and then your performance and cost return to exactly where they were before but you've mitigated the risk of the transition. So I will hand this back to Abe to talk about what it was like after we did all of this and then we were in the next Powerball drawing night. I'd be remiss not to thank again the Pantheon crew for everything they did on those two days. I mean, my bug risk board's not withstanding. We all put in some long hours of days to get this right because patch in a hyper-local sense is edge cases upon edge cases upon edge cases because if we're in Jolie at Illinois and we're in Greenwich, Connecticut, those things are about as similar as chalk and cheese. And so we just try to keep, we have everything can't be a one size fits all solution. So we have these things that Pantheon helped us out with. So at 1045, we were running about 19,000 concurrence which was warm but we were sort of saying well, what happened? Maybe, you know, what if they had a war and nobody came? What if we had a Powerball night and nobody came to patch, right? Or maybe the Google Jesus week that night. Exactly, exactly. So we were sort of sitting there, this is at 1045, watching, watching, watching. And that's at 11. So that jumped to 51. So we broke the previous one and you can see that now we broke the needles off the scale at 1101. That's 1104, that's three minutes later that we jumped another 20,000 concurrence. And by the way, we're invalidating stories all over the place still. I mean, people are invalidating, we're sending out breaking news alerts which when a breaking news alert goes out of patch, say we send one to Ridgefield, Connecticut, it puts a breaking news story on the front page of Ridgefield, Connecticut. So it invalidates that page. So every single time we're doing this, we're invalidating across 1,000 different domain sites or sort of URLs all through this period. That's 1105, that's one minute later we've jumped another 20,000 concurrence. There's 1106, which is 110,000 concurrence. This is what we saw on Fastly, by the way, which is that Fastly was taking 81% of our traffic. And that still means that we were seeing five or 6,000 hits coming right through Fastly through the Pantheon Edge. And that's what our throughput is at the time, about 5,000, that's right to the metal on the front end. This actually understates the Fastly hit rate a bit because the way that the origin shield works is that if you miss the point of presence and then you hit the origin shield, then it counts as one hit and one miss. Because each hit and miss on each area is accumulated independently and then all rolled up. So I think we were probably in excess of 90% of traffic really actually being absorbed by Fastly. Right, that would make sense though. And that's the screenshot of record that we have which is GA saying we had 250,000 and chart feet saying we had 117,000. It went higher but we didn't catch it. And I think that's, is that our, I don't know if that's our last slide or not, David, but I'm gonna just check. That's our last slide, so we made it. Thanks to these guys and thanks to the engineers at Patch too, all six of them. And the Fastly team for like turning this out in 24 hours. Yes, and Fastly, and Fastly. Yes. And so it also shows why it matters not necessarily assuming that hey, this is where we maxed out before. Let's implement double the infrastructure and hope things work out. Because we would have had to implement triple the infrastructure and had good balancing between it in order to sustain this traffic with the architectures we were using before. I agree. Yeah, we can open it for questions. Questions and there may be one more slide there. It's just the, oh, this one. So yeah, questions please. So this is for Patch. How did you guys get ranked number one? Is that just happenstance? Google blesses you and would it give it that take us away. And for some reason, one of the nice things is that Patch is local. We are hyper local in a thousand towns across the country. And when the winner's in a Patch town, we're gonna have the story probably first anyway. And Google's gonna privilege that story over a lot of these things. We do have a pretty decent SEO on the site. We are still doing spectacular numbers of things wrong. And which gives us a lot of room to improve. But we get blessed here and there. We've got blessed on a couple of the primaries. And you just never know why it happens sometimes. But a lot of it has to do with being out of the gate early with a story that's a good story. And one of the nice things about Patch stories is that they are generally, if they're really a local story, they're really a good story. And they're unique. So that gives us a bit of an edge there. All right, thanks. Sure, sure. And this question is for Pantheon. First, that's an impressive piece of engineering and an incredibly short timeframe. My site is slightly different in that I don't know when that massive traffic spike is coming. What would you do differently? Would the same applications, architectures do work? Well, if you're worried about the traffic spike, then you need to be thinking about exactly how far up that could go. Part of what we do at Pantheon is by putting a cluster of varnished boxes in front of every single site, we can actually handle top 100 site style traffic on a typical day without any issues. And top 100 style site traffic on a typical day is an enormous amount of traffic for a normal site. And I would also recommend looking at your options at throwing a CD in front. You wanna look at the costs, the implementation times. Some CDNs like Cloudflare start free and you can actually get a lot of their benefits without having to even spend anything. Fastly starts at paid rates but offers some more customization of the cash in there. Okay. I would also add we never know when we're gonna get a spike either. Last night I had to roll out code, I had a window of about an hour to roll out some code and suddenly the northern lights were visible in Connecticut and I was rolled out code on Pantheon with 8,000 concurrence on our site. Just had no choice. Okay, thank you. But oh, to that point, I opened in the fast, what helped there was having Fastly in play because I could then open up their TTL for another minute or two just to give myself the breathing room I needed to get the code out the door and then close it again. Yeah, I think that's sort of the other benefits that CDNs will give you are like, if you've got 500s on your site that it'll be serving traffic, you could pull that traffic so you're not serving those 500s. But another thing I wanna mention really quick was the, we saw at Sony where we had a lot of hyper local, sort of hyper local on the global stage was that these CDNs really helped serve traffic much faster because they're pulling from Germany or Austria or Japan. We saw a big boost actually even in organic search because our site was delivering a lot faster just by having content coming from closer points of presence. And then you can use, you know, there's some VPNs out there that you can sort of subscribe to and you can proxy your traffic through to these other, you know, whether it's around the country to sort of test these things out before you go live with them too. Hi, I think on one of the slides, you said you have a very long TTL at origin. Or sort of. It's not that long. Okay, so. Five minutes. Five minutes, okay. And then he said you're invalidating all the time. We blow the five minutes. Our editors are very impatient. Right. So if they make a change, they want to see it immediately. So. I'm very familiar with this wrong thing. It's only a long TTL compared to the 15 seconds in Fastly. And we also do, we stagger our TTLs too. So we don't just set a global TTL. We start in Drupal speak, we start with a global TTL and then anybody who's dealing with content anywhere down the execution chain can change that TTL as it's running out the door. So if we get a 404, we set that TTL to something very long because we don't want to serve that 404 again. Or if we see certain cases, if you go to patch.com and go to patch.com, Connecticut, Greenwich or whatever, there are 300 pages of content on patch.com slash Connecticut slash Greenwich, page two, page three, page 300. But we put a much longer TTL on pages three through X. Things like that. Sorry, is that within Drupal? Do you have some kind of module where you're setting custom TTLs by path or are you doing this? It's generally procedural and yes, we set the global TTL and then on hook exit, we set that into the headers. The other thing I wanted to mention here is that these short TTLs can be amazing for vacuuming up traffic. Because even a one second TTL on say a varnish box if you're able to handle the traffic through there is the difference between capping it at 60 requests per second versus a completely unlimited cap. And so at 15 seconds, we were getting at most four requests for any given thing from the origin shield for a specific URL that could hit the cache. And that makes all the difference in the world. And even that tiny amount of caching, just, I think at the peak, it was hitting something like a 96% hit rate in Fastly, maybe even 98%. And that's even including the caveat that I said before about if it misses the local point of presence and then hits the origin shield that it counts as a hit NMS. It was hitting points of presence at that rate just because everyone was viewing the same thing. So if your traffic spikes happen to be laser focused on particular assets on the site, a short cache time can do wonders. I think also we found a patch that if you really need to get traffic to your site, what you do is you bury a billion dollars somewhere in the country. And then you tell everyone that they wasted their time trying to get it. And last question, as far as the invalidation goes, are you doing any sort of integration between Drupal and Varnish and Fastly? There's integration between Drupal and Pantheon's Varnish Edge. We have an API on the platform for doing invalidations. There isn't any integration with Fastly right now. And that's part of the story of why we chose a super short cache lifetime because even a really impatient editor can generally wait 15 seconds for something to actually get fresh. And the actual work to do in integration to Fastly should happen eventually. And I've certainly recommended it to Patch, especially because of the cache keys support. But as a stop gap, especially if you're in a rush, a really short cache lifetime deployed in a very robust way can do wonders for a site without you having to worry too much about stale items and explicit invalidation. Not that I prefer one over other, but what were your reasons to choose Fastly over Cloudflare just a few days? Well, there was definitely a difference between how it could handle the case of any issue connecting to origin or over a backbone because when Fastly goes into grace mode, it's very transparent to end users, even though it shows up in the aggregate results in Fastly's dashboard and logs. So we can know it's happening even if site visitors don't, which is really nice from a brand perspective. Cloudflare can keep the site online, but they put branding on the page that says that basically Cloudflare is keeping the site online, which was a little less desirable. And also in terms of really granular cache invalidation and cache key management, Fastly has been in that game for a long time. There are some newer cache invalidation features on Cloudflare that have launched, I think in the last six months, we actually have a Cloudflare person here in the audience. But the, oh, okay. And ex-pantheore too. And so there's starting to be more parity between the two services. Like Fastly's gaining the things that Cloudflare once had as fairly unique attributes and Cloudflare is gaining the thing, the distinctions that Fastly had. Sometimes it comes down to exactly how much granularity you need because you certainly can't inject VCL into Cloudflare. Second question. We're currently using Akamai. Can we achieve the same results with the low TTL on Akamai that you got from, say, Fastly? I mean, you can. There's nothing you really can't do with Akamai if you throw enough money and time at them. Thank you. But the one thing that I would caution is that I see some people aspirationally use Akamai when they're sort of a middle tier customer. And you just get the dredges of the resources from Akamai. Like you will work with their worst engineers and their worst implementation people. Because unless you're throwing a lot of money at Akamai, you don't really matter to them very much. How do you really feel about that? How do you really feel about that? So like I think they're a great product at the very high end, I think that choosing them at the middle tier gives you less value. Yeah, unfortunately, we are a video shop so that web bucket is a very small portion of the whole contract, so yeah. Okay, but thank you. So assuming the Fastly integration and Drupal work together, is there any reason to keep the varnish at the infrastructure end? That's a good question. Well, the answer right now is that only a handful of sites on Pantheon are going through Fastly, or any other CDN. And most sites are relying on the Pantheon edge for their scaling. And most sites can rely on that actually, and it's fine for them. It patches actually the first case and to date the only case of someone actually saturating the bandwidth and edge resources of Pantheon. Yeah, we didn't mention that Powerball went down during that time and we figured it out. I would also say that we are actually already doing that in certain cases. We are invalidating, we have a lot of internal sort of microservice architecture now, so we have three or four sites sitting behind our front. And some of them are using Pantheon's edge, and some of them, and then the same thing as using Fastly's edge, and invalidating Fastly and invalidating Pantheon in two different cases. So we do some sort of tricky stuff there. Do you have, this is the question to Pantheon, do you have this option available that customers can turn off Warnish? Do you have such customers? You can always send headers to Pantheon's edge that cause it to not cache, and then rely exclusively. That will go upstream to the CDNs more. Yes, and then generally you can configure CDNs in various ways where you configure how they choose their caching decisions and choose a way where you can indicate to the CDN that you wanna cache, even though you don't tell Pantheon it can cache. We actually have one huge website that I don't wanna name that is doing that exact model on Pantheon where they basically tell Pantheon's edge just pass it through, and then they use a CDN for their real caching, and then they do deep integration with the CDN for invalidation. And why is the reason that they chose it this way, like to completely skip? I think if you could, so I think if you're willing to do all the deep integration with the CDN to handle your invalidation there, and you're willing to pay for a CDN, and you have access to that, and people qualify to configure it, then it's technically a more scalable solution. And if you're, I wouldn't recommend trying to deeply integrate with both Pantheon's cache and a CDN cache, you should probably pick one or the other. You should either do what Hatch does right now, which is like CDN only for very short periods of time, or do what that other site does, and Pantheon's edge either for very short periods of time or nothing, and then control the cache in the CDN. And one last question. Do you guys offer custom orange configurations? We don't. Our typical approach is actually very similar to the direction that Fastly's been moving with configuration, which is putting more of it as standardized in-band headers, where basically you specify as much as possible using standard HTTP spec, and then from there with some extensions to what standard HTTP spec specifies, and then on the page you can inform the cache how to behave. So for example, in Pantheon, you can tell the edge cache to key versus various cookies for that content, and then the cache will know that you will deliver different pieces of content depending on the cookie that the person comes in with. But that's all handled in-band. And then for people who come to us and ask, how can I handle this custom VCL case? We sort of roll back and say, well, what are you trying to actually accomplish in terms of your caching strategy? And we will see if we can map that to what we have already. And if not, we will consider, is there a way we can allow customers to configure this in-band to satisfy this use case without introducing custom VCL? Then again, if you need custom VCL, that's something that Fastly provides. So that would be the other kind of extended advice that we would have is you can always disable any caching at Pantheon's edge and then implement something like Fastly and then throw whatever VCL you want in. I would add to that too, that this is a good example of what you talked about with the engineering in-depth too, because even if you put a five-second TTL on Pantheon's edge, you'd have a fallback if something went askew on the outside end. So. I might have missed part of this, but so you were talking about how it might take a long time to get your configuration up to the CDN to allow a certain pass, un-cached, or to configure SSL, if you're going through SSL, Akamai, how are you able to do this in two days, if that's the case? So you could do it self-service with Cloudflare in two days, and then with Fastly, you basically just email or call them and say, I have to turn this out really quickly, and they have people ready to do things like deploying certificates, and readying their infrastructure. Okay, and then how would the solution change if you were doing most of the work in e-commerce, but the content's always been helping with it, you know, you're doing most checkout processes on Black Friday? Well, for most e-commerce sites, with a checkout process, it's very, I'll hit one unique thing for e-commerce in a moment, but mostly it's the same as the logged-in user case, where you still want to be caching all of the static assets like JavaScript, CSS, images, et cetera, and then only generating the actual page content on demand. The Drupal 8.1 has introduced some things like BigPipe that allow you to progressively render and deliver the page to the browser to speed up rendering for authenticated user cases, which would be relevant to that. The other thing that I'll mention for e-commerce is when you implement a CDN, you're basically implementing a man in the middle, and you need to make sure that you don't miss any kind of compliance goals that you have, where let's say you need to meet PCI, if you're sending the credit card over that website and submitting it to your application, that will go in the clear over the CDN because they have to terminate the encryption to do their caching and then re-encrypt back to you, and you need to make sure they're re-encrypting back to you and that the CDN you're using is actually PCI compliant. Which both CloudFlare and Fastly are, by the way. My question was related to that as well. How does the HTTPS termination work? What does it happen? It happens at the CDN, at the point of presence. So the communication between the CDN and fountain is unencrypted? It is encrypted. Both CloudFlare and Fastly allow you to encrypt to the origin. It uses TLS. Anyone, any more? Microphone please, just for recording purposes. So this does handle very well the anonymous user case, but what percentage of your traffic is authenticated and what are your plans for scaling that? On the front end of patch, there is no authenticated traffic. We have an authoring box where we do deal with our authenticated traffic and then a completely different architecture. So that's the short answer to that question. So is it just like your editors that are authenticated? And about four... They connect to a different instance, don't they? Pardon? They connect to back-end rubble, right? Yes, they connect to the back-end services box never through the front. And there are about 40,000 active users currently authoring stuff, 400,000 that we consider unsuspicious out of our 6 million total user objects. The one thing I'll also say is even in cases that might seem to require authenticated users, you can certainly play games where as far as Drupal and the CDN are concerned, it's anonymous. Like, if you just want to offer commenting on your website and you implement something like Facebook comments, discus or something similar, that doesn't require them to be authenticated to Drupal. And so you can still hit all these edges while still providing some community interaction. Okay, I think we'll wrap it up. Thanks for coming. Oh, thank you. Just a last question. Would it be possible in an architecture like this, so this authenticated user cache case is driving me to this question. Would it be possible to configure a weight through the CDN and the pentane edge to another caching instance like, for example, off-cache or something like that for off-caching users? Sorry? What caching? Off-cache. Alt-cache or off-cache? Yeah. Yeah, that gets into some pretty complicated use cases. And you can do things with partners like edge-side includes where you can integrate some cached pieces of content into an overall uncached page. I haven't worked with off-cache in a while, and I'll say why because I haven't found that many cases where it's really solved the problem and it definitely complicates the architecture of the site. No question if it's complicated or not, but just if it's possible to configure a real separate way to cache authenticated traffic separated from the unauthenticated. I mean, you can set cache rules to cache whatever you want, but authenticated traffic would be distinguished by at least the session. And there's often not a lot of benefit in caching that is keyed by the session because you typically get low cache rates, cache hit rates. OK, thank you. Sure, any time. Just give me.