 So hi, everyone. Thanks for picking our session this morning. So this is Tag Your It. Keep your reverse proxy cache up to date. And so you've probably seen some of these questions from clients when they can't see that their content is actually updated, even though they've actually just made that update. So we're your session presenters. And I'm Christian. And just a couple little pictures and things of me. I live in New York City's theater district. And this is me flying Fifi, which is one of the last of its kind in a particular kind of aircraft. And also hiking on top of the state of Vermont's tallest mountain, Mount Mansfield. Live in Hoboken, New Jersey, which is across the river from Manhattan, which is one of the most fabulous views of Manhattan. I was recently introduced to four-wheeling ATVs and had so much fun. Now I want to learn how to drive a race car. And on the lower corner there is my roommate, my co-conspirator, my partner in crime, Harvey, the shelter dog. And Christian and I are from Columbia University, one of the world's great universities, located in New York City. It was founded in 1754, originally called King's College. And then we had a little revolution. And the school was renamed to Columbia, which means New America. Columbia has about 29,000 students, 18 schools, and boasts the largest number of Nobel laureates of any other university at 82. Christian and I actually work for the law school, Columbia Law School, which has expertise in many areas of the law. Of particular attention to this group might be our intellectual property faculty and our digital technology law faculty. Picture it is Professor Tim Woo, who some of you may know as having coined the term net neutrality. And another member of our faculty who may be familiar to you is named Eben Moglin. He's been a very dynamic advocate for the open source movement. I'll tell you a little bit about what we're covering today. The complex caching issues that we faced at Columbia Law School and how we solve them. Why it's critical for your site to reflect updated, published information as soon as possible, current invalidation techniques, and why they don't always work for all of us, as well as developments in building smarter caches and droop blade. When I arrived at Columbia, I was faced with some challenges as far as web goes. We had a legacy system that had more than 10,000 pages of unstructured content. To even bring it into some sort of usable site, we were looking at about two years' worth of development to deliver services that were sorely overdue. The problem was that our legacy system really had no future. There was no community. There was no more core development being done by the vendor. And there was no stakeholder support. So we started to look for a new content management system. And we had to establish guiding principles to what it was going to be. The first one is that we were looking for an open source solution. The solution had to be scalable, but also infinitely flexible for the vast number of customizations that we needed to be able to do. And we were looking for something that was supported by a community that was as vibrant and as innovative as our own. We discovered Drupal. A little bit about our production environment. We have about 5,500 authenticated users. We have 250 active editors of very robust public sites. We also have about 3,700 total unique visitors during peak periods. Right now would be one of them for us because we just started school. All right, so when it comes to our content, currently, and we're not completely migrated from the legacy CMS yet, anonymous users have access to requests. So this is just anonymous users. We have a lot more content that authenticated users can access. But for anonymous users, we have about 10,000 entities. And roughly half of them are extremely volatile, meaning that the data changes very often. And at times, we don't know exactly when it changes. We have a lot of editors. And the central team isn't always able to keep up with them. About 4,700 menu links belong to about 115 custom menus. Use custom menus to permission out those menu authorizations to different editors. Little more than 11,000 custom path aliases, 1,000 redirects using redirect. So these are interactively configured and not in HT access or in code. About 150 contexts right now. We use context module, 120 views, not that many right now, but we're working up. The same thing with taxonomy terms. We haven't really converted most of our big article libraries yet. And so our vocabularies and terms in taxonomy are pretty low at this point. So our infrastructure. When we decided to make this changeover, this was actually just before Drupal 7 was released. And we decided not to go with Drupal 6 for varying reasons. But the thing that we were sort of concerned about with Drupal 7 was with performance. And so we decided still to go with Drupal 7, but that we had to definitely do our research in our homework. So right now we currently have 190 total enabled modules, about 134 for contrib. And 56 have been built by us. And here's just some alphabet soup for some of the technology that we use to run the site. So when someone like me, I'm not a developer. I'm a communications director. And I get presented with this. And I really have no idea what needs to happen, except that we have to fix it. So how do we deal with all of this? Well, to quickly say that we had to take a look at the actual performance implications. And so because our team and the school really relies on having content being up to date at all times, we had to make sure that we could actually do that. So when we looked at performance in general, when we heard that there may be some more concerns with Drupal 7 than Drupal 6, we made sure that we took a look at other people's, other organization's infrastructure. So of course, one of the things that we added into the mix was reverse proxy cache serving. Fortunately, it also brings some problems with updated content. So let me first just go into how many of you know that normal page serving works. So normally, you would have your Drupal web app servers, probably a load balancer in front of that. And all requests are processed directly by those Drupal servers, and even for pages that haven't changed. Of course, that's not very performant. In this case, an editor updates, for example, an event node. And that event node gets processed by Drupal. It comes back to it. And also a visitor, let's say, requests that same event page. And even though it may have been requested 150 times in that last minute, Drupal still processes that request. So that's not very performant because the same thing happened over and over again with the same response. So of course, with reverse proxy caching, normally what happens is what we do is we add another tier here. We add the reverse proxy cache server, which may be varnish, or squid, or nginx. And you may even actually look at some work that's been done with Apache recently. But in this case, the editor still updates an event. It goes, passes through the reverse proxy cache update because it's a post, HTTP request. And in this case, normally what happens when you configure this sort of setup is that the event triggers somehow the cache server to expire the previous version of the event page. So now when a visitor requests that same particular event page, request is routed to the reverse proxy cache server. And because the reverse proxy cache server had already received a request to invalidate that page from its cache, it now requests the new updated page from the Drupal servers. And it stores it in its cache and then directs the response back to the anonymous visitor. But sometimes this doesn't always work, right? Because sometimes we're not actually talking about the one event page that was that one event node that was created. Maybe it was embedded content. Or maybe it was a rendered token. Or maybe it wasn't even anything that's had to do with a particular piece of content. Maybe it was more like an attribute, sort of environmental attribute. In this case, a visitor requests a different page that happens to embed the event in some way. It goes to the reverse proxy cache server. But this time the cache server doesn't know to expire that page or to request a new version of it. Because it never really received a response, a request from Drupal originally to say to invalidate anything with that event on it. So it's just doing its job. And it actually just serves right back that exact page that it had, even though that the events that may be included somewhere on it has been updated. So some of you might be saying to yourselves, so what? Do we really have to worry about these edge cases? Can't clients just wait 15, 20 minutes for content to expire by itself? And the answer is no. And the reason why is because whether you're a communications professional or a web developer, you're working for an organization or an entity that has a brand. And brand management is actually everybody's responsibility. Any piece of content, any interaction that you have with a client, a customer, is an exchange, a contract, if you will. You're setting up expectations about what you're going to deliver to that person. So when you have public facing information and it's not up to date, you're going to be diminishing the value of your brand. Every touch point that you have with a customer should be a positive experience so that they trust they're going to be getting what they're purchasing from you. Brand stakeholders also, the people who pay our salaries, want to know that updates have been confirmed, even if they're pieces of content that some might consider to be routine. But so you say, OK, well, that makes all sense now. Aren't there contrib modules that already do this? Well, there are, to an extent. So there's cash expiration, right? So it's actually a really great job. And let's pair it up with Purge module. And also varnish HTTP accelerator integration module. So Expire actually tries to predict the expired paths or the ones that are likely to be expired by an event, such as an update, a delete, an insert hook. And it works well for most sort of cash expiration events and invalidation events. Sort of using the two interchangeably, although I do realize that there's a difference. But for this purpose, I'll use them interchangeably. Fortunately, this is not really for complex scenarios, right? It's impossible to predict every single path that a piece of content may be on. You can do a pretty good job with it. But for embedded content, for sometimes with render tokens, with rendered entity tokens, with attributes, environmental attributes, there's really no way of figuring that out. And Purge, of course, just to explain, sends a Purge HTTP command over to request, over to varnish, or in Ginnix, or whatever reverse proxy cache server you're using. Varnish HTTP accelerator integration communicates instead over an admin socket. And can implement Core's cache API and acts as a sort of pseudo-page cache, even though it doesn't really cache anything. That Drupal can actually pull up. But it's only as smart as Core and Contrib modules that enhance Core when expiring page caches. So how have other web teams actually traditionally coped with this problem? Well, there's a couple of ways. The first is shorter TTLs, or Time to Lives. So this is the amount of time that a page can actually remain in the cache before it's booted out. But unfortunately, this is on sort of a per user basis once that content gets delivered to the user. So if an important change is needed, site editors and owners can override the setting once a user's browser receives the page. So if we set the TTL to 15 minutes, that means that it could take up to 15 minutes for that particular page to be updated in reverse proxy cache, or if that was propagated to the user's browser in the user's browser after they received it from the reverse proxy cache. So just 15 minutes work. And Livy would respond, no. So then there's mass invalidation, right? So this is just sort of wiping everything out. But this is highly, it's really not performant, right? Because if you have a large site with a huge cache, you're going to have considerable performance issues trying to warm that cache back up again. So that doesn't work either. Well, here's one that I sort of like because I don't have to deal with it. Essentially manual search and handpicked pages. So if you really know your content, right? So you can go through the pages and maybe have a link to actually purge each of those pages, little block that allows you to do that. Unfortunately, this doesn't really work when you're dealing with a huge site with multiple editors because it takes considerable time. And it's still sort of impossible to ensure that you've got all the affected pages under control. So then there's another one. You could hard code the rules into the reverse proxy cache, or you could add some additional logic in your module to sort of manipulate page headers in Drupal. But this is really difficult to manage. And it's equally difficult to explain to stakeholders and content editors what actually triggers something to be updated in the reverse proxy cache. Because they don't tend to be very technical all the time. And the other thing is that you may have this challenge. But if you're onboarding new team members often, it presents a significant learning curve to them. So my role in this endeavor is that I'm the one who gets all the complaints. And nobody wants to know how we do it or why we do it. They just want to fix it. So I come to Christian, and I say, this is nice. All these things sound great. But they're not enterprise level solutions. So what do we do? There's got to be something that automatically can update content that just can't be that hard. What is it? So that's when I went to d.o and took a look also at what the W3C was doing and what other sort of big CDNs were doing like Akamai. And so some of the things that other people were doing were like cache channels, which sort of allowed you to have a grouping and then just clear the grouping and some other invalidation techniques. And one that came to mind pretty quickly was this idea of tagged cache invalidation. So we started to just tag things. Let me explain that a little bit more. So first we add a tag. In this case, we'll call it a cache tag, but it's a tag for that piece of content. So we still have the same sort of setup here with the reverse proxy cache server in the middle. And now a visitor, anonymous visitor, requests a page. And let's just say for this instance, the reverse proxy cache server hasn't ever dealt with this page before. So since it doesn't have it in its cache, it's going to go to Drupal to request it and actually process that response, that request. So now what happens is Drupal adds a tag to it. So in the basic layers of it, if it's an event page, it's going to at least add a event page tag to it. And now it's going to update that event page. That updated page is then served to the anonymous user. So the reverse proxy cache stores that tag, which maybe in what we use actually X invalidates and X invalidated by as our HTTP header tags, stores that information, and it just sends the response back to the visitor. So how does that all work, right? Well, let's take a look at the scenario that an editor updates an event again. So that post gets sent all the way through passes through the reverse proxy cache server and goes directly to Drupal. And that update, however you'd like to do it, triggers the cache server to expire and validate the previous version of the event page. And any page is tagged with that sort of basic tag. So now when an anonymous visitor comes along and requests an event page, that's routed to the reverse proxy cache server. And since it was just moments before updated by an editor, it knew that it actually had either purged or banned it from its cache. And so it goes to Drupal to process that request. And Drupal answers and responds and adds a header tag and the reverse proxy cache server stores it and sends the response back to the visitor. But this is how this sort of fails, right? So if we're only adding one tag for that main piece of content, it still doesn't work for embedded content. So if visitor requests a different page, that just happens to embed the updated event or an attribute of some kind. Then it's still routed to the reverse proxy cache server. Except since it's a different page with possibly maybe a different tag that it might have put on it or a different URL, the cache server still doesn't know to expire that different page since it wasn't tagged as having included that embedded content. And it still serves an old version of it. So that sort of still fails. We had to figure out how to deal with this issue. And when we looked at a couple other modules, like cache tags and looking at how expire module works, we sort of saw that we couldn't necessarily use anything that really exactly worked with Core's caching API. There's a couple of reasons for that. We don't always cache everything in Core. There's also a lot of things that aren't really cacheable, such as elements of what theme you're using. So we had to figure out another way. That's how we basically built HTTP cache tag auto, which makes reverse proxy cache tags, if you will. Mark worked a hell of a lot better. So it's an experimental Drupal 7 module to solve these complex cache invalidation issues. It builds on some of the best innovations out there by our Drupal community. And it can automatically tag and invalidate pages based on rendered entities, the current theme, blocks, taxonomy terms, contexts, can do menu items, but that can get a little hairy sometimes when you have a lot that you're actually rendering, views. So let's take a look at how that works. First, it actually adds all the relevant but missing tags. And this is through a combination of any kind of hooks that you want to desire, right, that you actually want to use to add tags. So we have a few that we actually hook into right away that we use, but you can add more as well, because we've added a hook to allow you to alter these tags and also just another function that allows you to add tags on the fly in your own hooks. And so this time, an anonymous visitor requests a page, let's say, and it still gets routed through the reverse proxy cast server. But this time, Drupal adds a bunch of different tags. So in this case, it adds still that event tag, because the visitor requested some type of event page. In our case, it also tagged it as having a person embedded and we're relating to a professor, Professor Doe in this case, a term. So maybe that event was a conference and related to tax law and a bunch of other things. Maybe there was a block that actually rendered a news item on it. So we include that on it, because that news item might change. And then maybe it also rendered a custom block of some kind with some arbitrary HTML and any sort of other additional tags, maybe the context that it was in. So now Drupal serves those tags. The reverse proxy cast server stores them. And that response is sent back to the anonymous visitor. So let's see just how that all works now that we've added those tags. So we still have the same setup with the reverse proxy cast server tier in the middle. Editor updates an event that's routed through and passes through the proxy cast server. So this time, however you'd like to do it, if you use purge or if you want to use some other kind of different module, or maybe you're going to use queue, because you don't want to necessarily have those requests lock up the actual requests to the page, to the reverse proxy cache. Maybe you want to set them up so that you're not dealing with the actual current flow of the requests for that event user. So maybe you put them in queue, but it's really up to you. And at that point, then all pages tagged with that event tag are invalidated or expired, however you set that up. So now when a visitor requests that event page, goes to the reverse proxy cast server. And since we had actually invalidated or expired that page from the reverse proxy cast server by using that tag, then the cast server actually requests the updated event page from Drupal. Drupal processes it, sends along multiple tags, including the event tag, specifically the event tag. It stores them and sends that response back up to the visitor. So let's see if the litmus test, let's run the litmus test. So those two things worked before in every scenario. But the real problem was when we actually did the embedded content or that attribute that really wasn't cashable in any other kind of way. So let's see if it still holds up. Well, the visitor requests, let's say, a different page happens to embed that piece of content. So at this point, it still holds up because it goes to that request is routed, still again to the reverse proxy cast server. And the cast server requests the page because that page had been actually tagged with that event tag, even though it wasn't an event node page. And Drupal goes ahead and sends all the requisite tags up again because there may be another tag that got included this time because maybe a news story changed. It stores those tags, stores the response from Drupal, and then it serves it back up to the visitor. So it still works in this case. So this experimental module also provides a hook allowing you to, like I said, a function that allows you just to add very easily a X invalidates or X invalidated by. It's really just to help our module because you could do that very easily yourself. It adds a sample reverse proxy cache configuration rule for varnish so that you can see how to set that up. And we'll have rules and context support. So it's available for you just to take a look at conceptually at law.columbia.edu for its slash open source, then also directly at our sandbox URL. So some things you might want to take a look at. This doesn't implement directly the cache API, as we said, because not everything is necessarily cacheable, nor do we want to make it cacheable in core. But some things that you should take a look at is cache tags, certainly, which in Drupal 7 changes cores cache API. It does support varnish now, but it's still only limited to the items that are actually cacheable by core or made cacheable by core through contrib modules. But you should definitely take a look at cache API in Drupal 8. Here's just a couple of nodes that you should take a look at. With cache tags being included in validation and the new Cleaner API. In addition, you might want to take a look at tagged cache in validation in general, because there is a W3C. Well, it seems like there is going to be a spec that's going to be proposed with the W3C. Because essentially right now, we use x invalidates and x invalidated by. Some of you may know that W3C has essentially said any custom HTTP header is deprecated. So we don't actually ever send those headers directly to a browser. Once they get to varnish, we remove them so that the browser, the client, never sees them. They are invalidated. So hopefully what will happen is that a spec and a proposal will be drafted that will extend the current cache control extension for HTTP. So that would be the best thing that would happen at this point, or how we implement things, or why we did something a different way. So the question was whether we could actually tag a view, and specifically the views content. So in that case, what I would suggest is that you use views content cache, I believe it's called, which is a new module that has a cache back end for views. And in that case, you would probably use the cache tags module. You would have to patch core if you're using Drupal 7. And then I would change the varnish back end to instead of sending the X cache tags header to instead send over the X invalidates question. Thank you for presentation. Thank you. I'm very interested in practical example. Can you open the page and explain where and which tags you add and how you, in which hooks you add them and in which cases you invalidate them? Right. I would take a look. I think it's a little hairy to do it right here. So I would take a look at the code, and I can certainly answer any questions via email as well. But for instance, if we go through a sample page that has, let's say, a node that has an event and maybe has a block as well that has two news stories, what would happen is that we would call hook entity load. And that would actually pretty much solve both of those concerns at that point. You could use hook entity view. The problem with hook entity view is that it's not always going to catch everything. So because that's really only rendered entities. And so you may not get that if you're loading something up. Now the thing is, is you still have to figure out whether what you're actually showing to the user is loaded or if it's just viewed. So in many cases, you may want to change the actual hook that you're using. But if you use load, you're probably never going to get it wrong, except that you are going to probably invalidate a lot more pages than you would with view. Thank you. Thank you. Nice work. I had two questions. The first is, are entity reference fields also correctly supported? It sounds like they are because of your twist to use hook entity load instead of hook entity view. They are. They are. We don't use entity reference right now. And so that wasn't something that was directly something that we needed to build into it. But because it's using hook entity load, I think it would work. Right. But I haven't tested it. But as you said, it would catch too many things. But you could catch too many. Right. So that's why you really have to know your site so that you're using the right hooks in that case. Right. So second question. I've also been working with the cache tag support in the Triple 8 core. And you mentioned this as well. But it sounds like everything that is problematic in that regard, hook entity view versus hook entity load, that's going to be solved in Triple 8 because their things are being tagged correctly. As of, I think, two or three days ago, we have entity render caching in Drupal core based on cache tags. Yes. Thanks to Amatisco. I'm not sure if he's here, but great job Amatisco. That's good. Exactly. So the question is, do you plan to work on a Triple 8 version of this as well? Because if you start working on that soonish, you are still in a good position to make changes to Triple Core, if necessary. And then you have a perfect solution, and you won't have to do any sort of facts, like using hook entity load instead of the DPU. Instead of you, right. So if you are up to it. Yeah, I've been trying to keep up pretty well with those issues. So I would say that, yes, I would hope to do that. Yeah. Awesome, would be great. I mean, this is more of just like a concept than anything else. So yeah, I think that would be great. Yeah, absolutely. Yeah, then we would have reliable internal page cache in Triple 8, as in without varnish. Right. Your solution with then, make sure that it also works perfectly with varnish. Or external, right. Which would be awesome if all of that were just without any pains, headaches, and frustrations. Right. Triple 8. Thanks. Thank you for your question and comments. Hi, thank you for the presentation. Thank you. I'm still new to Drupal, so how would this work if your page is heavily AJAX? You have a lot of blocks loading through AJAX and always new stories are being loaded. How would the cache work in that case? Would it serve new content if there's any changes? Right. Now to you, in your varnish setup, do you cache your AJAX responses? No. Yeah, see, we don't either because there's so many, I mean, there's a lot of, we also don't use it too terribly often. So in our context, we don't. However, if you're still using load, if you're still loading entities, right, that's still going to happen. So if you haven't changed your varnish setup to actually not cache those responses, then it would still actually have those HTTP header tags in them. So if you left sort of everything set correctly, then in theory you would actually still invalidate them at the correct points. Because there's still nothing different in that response. It's just that you're serving up a different content type, HTTP content type. Instead of HTML, you're serving up a JSON. So in theory, it should work exactly the same. But we don't happen to cache right now our AJAX responses. Thank you. Thank you. Do you have any problems where a tag might affect every page accidentally? And you might end up invalidating your entire site. I remember that first time. I think you might be lying. Yes. What strategies do you have around that? Are there any ways of doing, say, certain things on different time schedules or other things? Right. Just to see your thoughts. We haven't terribly dealt with that. But we know that it's certainly a concern. In some cases, we want that to happen. So when we add the actual current theme, it's really quite nice because we do have a couple of themes that we use throughout the site. And we don't necessarily want to invalidate the entire cache. But we're still invalidating a huge portion of the pages that are stored in the cache. I think what has to be done is probably a decent logging of what cache tags are actually being sent in some type of aggregated way. I wouldn't say that to the database, though. I would probably save that to Memcache or some kind of system where it could quickly actually update those. You would just need it to see the aggregate and make sure that this is not happening everywhere. But you could also take a look at when you're actually running your site in peak times to look at varnished top and see what's actually happening. You mentioned about using queues. Maybe there's a way of saying, well, actually, you can do this one much later. So expire this tag now. But expire these in 15 minutes. I think that makes a lot of sense. Yeah, that's a great idea. I'll try it as just a thought. Yeah, no, I think that's awesome. Definitely. Thank you. Thank you. Maybe it's being served for us, right? Yeah. Specifically, if only for that reason. I mean, like non-wine shelves with recent new products, recently compared products, kind of. So is the question that still for anonymous users, right? You may have a block that is specific to them, in some regard. Well, in our case, if they have sort of any cookie that's specific to the user, then we actually bypass the entire cache. We haven't needed that kind of capability, because essentially, we almost always any kind of interactive experience we have at Columbia Law School. We create an account for that user, and so then they're authenticated. So we're probably not the best at answering sort of those questions when it comes to more so sort of product commerce sites. Yeah, but that's fine. Well, you could do it through ECI, ESI as well, right? Which is, I mean, it seems like it's going to be a heck of a lot easier to do in Drupal 8 as well, or for different roles, right? Right. Yeah. I mean, this whole concept could be extended to authenticated cache too, right, if you could figure out the pages that really changed. Because you could cache a page with a user tag or a role tag, in which case then it would only invalidate it based on that specific scenario. Thank you. It was pretty much the same question, but to do with HTTPS, I take it your anonymous users don't visit the site through HTTPS since varnish. Yes. Well, that's sort of different, because you can use HTTPS through varnish. So you just have to use, well, it seems like it's much easier in varnish 3. And so we've actually decided to sort of probably move in that direction. And what's even better with varnish 3 is that it does support G-zipping, finally. So you can actually get a better performance on the user side as well. So there's a lot of pluses in moving into the varnish 3 direction. Thank you. Well, thank you so much for all of your time. It's been a pleasure. And I hope that if you have any other suggestions or anything that you find that we may be able to do better or have any questions or want to talk about share ideas, then you can certainly contact us. This is my email address, cstock at law.columbia.edu, and also just the Prague URL. And Libby's, particularly if you have concerns about what the client expects and what the client needs when it comes to updated content. And so one thing that we'd love for you to do is tell us what you think and take the survey. So you can take that at prog2013.drupal.org for its last schedule. And then click on the Take the Survey link. So thank you so much for your time. And I hope this has been helpful. And have a great time with the rest of the conference. Thank you.